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1 .  Executive  Summary 


This  document  constitutes  the  Final  Report  (CDRL  A004)  for  RADC  Contract  No.  F30602-81-C-0297. 
Decentralized  System  Control.  This  report  covers  the  period  November  19, 1981  through  March  31. 1984.  It 
describes  the  results  of  our  proposed  work,  which  was  performed  by  the  Archons  project  during  the  contract 
period.  It  focuses  on  the  studies  of  the  fundamental  issues  of  decentralized  system  control  ranging  from 
investigations  of  decentralized  resource  management  principles  to  architectural  support  for  decentralized 
operating  systems.  The  report  also  includes  a  development  plan  of  the  decentralized  Archons  operating 
system  (OS),  called  ArchOS.  The  description  of  a  simulation  environment  of  the  decentralized  algorithms, 
called  DATE  and  the  current  status  of  the  Archons  interim  testbed  arc  also  described. 

1.1  Archons  and  ArchOS  Objectives 

The  Archons  project  is  performing  research  on  decentralized  management  of  operating  system  level 
resources  globally  for  an  entire  computer  in  which  physical  dispersal  causes  variable  and  unknown  com* 
munication  delays.  We  are  interested  in'  a  very  specific  form  of  resource  management  decentralization: 
decisions  are  made  by  a  team  of  equals  who  negotiate,  compromise,  and  reach  a  consensus  -  the  objectives 
are  improved  robustness  and  modularity  compared  with  conventional  unilateral  resource  management. 
Making  decisons  thusly,  despite  inaccurate  and  incomplete  information  about  nonlocal  state^ involves  non- 
deteiminisuc  computations.  The  scope  of  this  management  encompasses  the  operating  system  resources  of  all 
physical  nodes  in  the  computer,  unlike  a  network  which  has  communicating  local  operating  systems.  Failure 
atomicity  requires  a  transaction  facility  in  the  OS  kernel,  but  the  usual  serialization  model  of  data  consistency 
is  insufficient  for  OS  use  -  so,  we  have  supplemented  it  with  a  relational  one.  The  abstract  types  to  be 
managed  within  a  decentralized  OS  are  different  from  the  objects  found  in  traditional  databases,  requiring 
innovative  transaction  techniques.  — 

We  are  also  interested  in  the  architectural  implications  of  our  unique  approach  to  operating  systems:  they 
arise  in  the  interconnection  structure  and  in  the  processor.  Consequently,  we  believe  that  each  node  of  the 
computer  ought  to  consist  of  an  application  subsystem  and  an  OS  subsystem.  The  former  may  be  arbitrary 
and  heterogeneous  but  we  are  designing  the  latter  ourselves.  The  OS  machine  (named  Meta)  at  each  node  is 
an  unusual  functionally*oriented  multiprocessor  having  an  extremely  maleable  architecture  to  accommodate 
watever  OS  support  mechanisms  are  desired.  In  addition,  the  hardware/software  implementation  tradeoffs 
are  transparent  to  the  OS  programmer.  There  is  a  substantial  experimental  component  in  the  Archons 
research,  and  the  initial  experiment  is  performed  by  the  Archons  interim  testbed  which  consists  of  a  set  of  Sun 
workstations  interconnected  by  an  Ethernet. 


The  objectives  of  the  Archons  project  in  general,  and  of  its  ArchOS  operating  system  portion  in  particular, 
differ  significantly  in  a  number  of  ways  from  those  of  the  other  distributed  system  and  distributcd/nctwork 
operating  system  efforts  we  arc  aware  of. 

The  foremost  of  these  dissimilarities  has  to  do  with  our  concentration  on  the  special  case  of  "distribution" 
which  we  term  decentralization  (explained  further  in  Chapter  2): 

•  to  explore  the  fundamental  nature  of  making  and  carrying  out  decisions  in  a  highly  decentralized 
fashion 

o  for  resource  management  in  general, 
o  but  for  operating  systems  in  particular: 

•  and  thereby  to  facilitate  the  creation  of 

o  substantively  improved  computer  systems  in  general  (including  uniprocessors  and  computer 
networks), 

o  but  especially  a  novel  decentralized  computer  which  can  be  physically  dispersed  yet  which 
exhibits  the  optimality  of  executive  level  global  resource  management  hitherto  confined  to 
physically  concentrated  (and  highly  centralized)  uni*  and  multi*  processor  computers. 

Our  principles  of  decentralized  decision  making  and  resource  management  have  wide  applicability,  from 
integrated  man-machine  systems,  through  application  software,  operating  systems,  and  down  to  machine 
hardware.  However,  we  are  focusing  on  the  OS  levels  (and  below)  for  three  important  reasons: 

•  the  OS  is  a  constant  beneath  many  changing  applications  -  this  provides  generality  and  lowers 
system  costs  by  solving  resource  management  problems  once  instead  of  leaving  them  to  be  solved 
repeatedly  by  the  users: 

•  the  degree  and  cost  of  successful  decentralization  above  the  OS  depends  on  success  ai  the  OS 
levels  and  below; 

•  OS  problems  are  almost  always  the  most  general,  complex,  and  dynamic  (elsewhere  in  a  system 
the  resources  are  usually  more  dedicated)  -  consequently,  solutions  at  the  OS  levels  are  more 
likely  to  be  amenable  for  use  at  higher  or  lower  levels,  while  the  converse  is  much  less  likely. 

Any  system  can  be  expected  to  have  resources  local  to  each  node,  but  we  are  disregarding  them  in  our 
research  since  managing  them  is  so  well  understood  by  comparison  with  managing  global  resources. 


1 .2  Summary  of  This  Period’s  Tasks 

This  section  provides  a  very  brief  summary  of  each  task  described  in  this  period’s  final  report  -  such 
information  as: 

•  objective 

•  role  in  the  overall  Archons  project  plan 

•  research  contribution 

•  current  direction 

•  problems 

•  accomplishments 

•  future  expectations. 

1.2.1  Decentralized  Resource  Management  Principles 

Section  12  is  devoted  to  the  issue  of  seeking  a  new  resource  management  paradigm  which  is  intrinsically 
decentralized  without  the  historical  centralized 'biases  and  artifacts.  This  most  fundamental  issue  underlies  the 
entire  Archons  project.  It  appeals  to  be  philosophical  and  qualitative  because  the  conceptual  shift  is  dramatic 
and  not  yet  amenible  to  analytical  formulation.  It  also  runs  against  the  prevailing  tide  of  intellectual  inertia 
and  career  investment.  Like  many  positions  which  espouse  new  philosophies,  methodologies,  etc.,  it  can  be 
best  appreciated  by  those  who  have  substantial  experience  with  the  older  alternatives,  the  resulting  problems, 
poor  solutions,  and  attempted  better  solutions. 

We  are  concerned  with  both  physical  and.  logical  decentralization. 

Physical  decentralization  does  not  mean  simply  spacial  dispersal  of  nodes  as  in  computer  networks.  Rather, 
it  addresses  the  objective  of  knitting  a  collection  of  spacially  dispersed  nodes  into  a  single  computer.  This 
requires  a  completely  new  kind  of  operating  system:  one  which  has  physical  distance  inside  it.  This  results  in 
variable  and  unknown  communication  delays  which  are  significant  with  respect  to  the  rate  of  system  state 
change.  New  concepts  and  techniques  for  resource  management  are  called  for  -  these  include  in  particular 

•  replacement  of  the  "garbage- in/garbage-out"  point  of  view  with  a  "do  the  best  with  what  you  can 
get"  one  (i.e.,  accommodating  and  even  taking  advantage  of  indeterminism); 

•  replacement  of  the  "processing-oriented”  point  of  view  with  a  "data-oriented"  one,  in  which  the 
principle  goals  are  maintaining  the  consistency  of  data  objects,  and  the  correctness  of  the  actions 
carried  out  on  those  objects. 


Logical  decentralization  in  our  ease  is  interpreted  as  being  in  a  region  diagonally  opposed  to  the  origin 
(representing  maximal  centralization)  of  a  7-dimcnsionai  space  of  resource  management.  The  initial  version 
of  this  model  was  developed  on  an  earlier  contract,  lire  degree  of  logical  decentralization  we  aspire  to 
involves  each  decision  being  made  by  negotiation,  consensus,  and  compromize  among  team  members,  each  of 
whom  is  equal  in  authority  and  responsibility  but  has  a  different  perspective  due  to  being  resident  on  a 
different  node  of  the  computer. 

This  task  has  reached  a  plateau:  the  first  order  technology  requirements  are  clear,  and  are  being  pursued  in 
other  tasks.  Foremost  among  those  are  an  information' theoretic  technique  based  on  team  decision  theory,  and 
a  complementary  heuristic  effort:  both  need  a  resource  management  context,  and  we  have  selected  assigning 
processes  to  processors  (a  context  which  is  valuable  from  a  system  point  of  view  as  well).  Feedback  from 
those  are  causing  the  philosophical  framework  to  be  updated  and  further  developed. 

1.2.2  ArchOS 

ArchOS  is  the  name  of  the  Archons  project's  initial  decentralized  global  operating  system.  In  the  proceeding 
years  of  our  research,  we  have  focused  on  philosophy  and  technology,  while  keeping  an  informal  vision  of  the 
Archons  system  and  ArchOS  operating  system  in  mind.  Now  drat  philosophy  and  technology  are  mature 
enough  for  us  to  begin  solidifying  that  vision.  This  will  be  one  of  our  major  activities  during  the  follow-on 
contract  periods,  beginning  with  a  document  defining  our  goals  and  objectives,  together  with  the  require¬ 
ments,  for  ArchOS.  That  will  be  followed  by  a  functional  specification,  design,  and  implementation  (on  an 
interim  testbed  of  Sun's).  Section  12  summarizes  the  major  objectives  of  ArchOS.  and  Section  13  discusses 
our  development  plan  for  ArchOS. 

1.2.3  Predominant  OS  Functions 

This  effort  is  intended  to  identify  the  OS  functions  most  in  need  of  architectural  support:  those  which 
consume  many  processor  cycles  as  a  result  of  their  complexity,  frequency  of  execution,  or  fast  response  time. 
Ideally,  this  task  would  be  focused  on  decentralized  operating  systems  such  as  ArchOS.  Obviously  this  cannot 
be  done  at  present  since  ArchOS  is  not  far  enough  along,  so  a  first  approximation  is  being  based  on  central¬ 
ized  operating  systems:  this  will  at  least  help  formulate  performance  benchmarks  for  the  architecture  studies 
discussed  in  Chapter  6.  We  have  recently  de-emphasized  this  effort  due  to  the  greater  need  for  ArchOS  design 
manpower,  and  to  the  difficulty  of  acquiring  the  intended  information  about  other  operating  systems. 
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1 .2.4  Transactions 


We  have  two  different  tasks  under  way  on  atomic  transactions. 

•  One.  described  in  Section  3.1.  is  oriented  toward  decentralization  and  programming  modularity. 
To  the  best  of  our  knowledge  it  is  totally  unique  research  in  that  it  utilizes  only  transaction  syntax 
as  a  basis  for  consistency  of  nonscrializable  transactions.  The  concepts  in  this  phase  of  the  effort 
have  largely  stabilized,  and  most  of  the  work  has  been  formal  -  since  the  notions  of  data  consis¬ 
tency  and  transaction  correctness  underlie  our  entire  system  approach,  it  is  incumbent  upon  us  to 
provide  a  theoretical  foundation  to  insure  their  validity.  We  were  first  able  to  analytically  prove 
that  no  alternative  approach  using  semantic  information  could  provide  better  concurrency  of 
actions:  then  we  proved  tliat  none  could  even  do  as  well.  The  next  stage  of  this  task  is  to  deal  with 
fault  recovery. 


1*  Of  course  theoretical  research  necessarily  makes  simplifying  assumptions:  in  actual  systems,  com¬ 
plex  and  informal  tradeoffs  are  required.  For  this  reason,  we  have  a  second  transaction  task  taking 
a  different  tack,  as  shown  in  Section  22.  It  is  less  decentralized  in  that  it  uses  global  semantic 
\  information,  aligning  it  with  the  few  other  research  efforts  we  know  of  in  the  area  of  nonscrializ- 

•  able  transactions.  It  concentrates  more  on  complex  abstract  data  types,  and  has  already  made  fault 

recovery  a  major- theme.  Progress  here  is  slow  but  steady. 


1.2.5  Interprocess  Communication 

IPC  is  an  essential  element  in  a  distributed  system,  and  even  more  so  in  our  style  of  decentralization.  We 
expect  to  experiment  with  a  variety  of  new  IPC  concepts  and  facilities  for  AichOS  in  order  to  obtain  the 
benefits  we  seek.  This  necessitates  a  design  methodology  which  permits  rapid,  easy  redesign  and  reimplemen¬ 
tation.  Policy/mechanism  separation  has  for  some  time  been  considered  in  this  regard,  and  has  been  at¬ 
tempted  in  limited  OS  contexts  such  as  process  scheduling.  However,  we  believe  that  it  has  not  been 
approached  properly,  and  consequently  it  has  been  far  less  than  successful.  Policy /mechanism  seperation  has 
never  been  even  attempted  in  IPC  which  is  far  more  complex  than,  for  example,  processor  scheduling  or 
memory  management  We  expect  to  make  important  contributions  to  both  the  fields  of  programming  abstrac¬ 
tions  and  interprocess  communication.  Some  of  the  biggest  intellectual  hurdles  have  been  overcome,  and 
constant  progress  is  assured  through  the  duration  of  the  next  contract  period. 


1.2.6  Decentralized  Algorithm  Testing  Environment 
Chapter  5  outlines  DATE,  a  discrete  event  simulator  for  performing  experiments  with  decentralized  al¬ 
gorithms  on  VAX/UNIX.  Even  with  our  Sun-based  Interim  Testbed  operational,  a  proper  simulation  facility 
offers  advantages  which  would  be  more  expensive  to  achieve  on  the  testbed:  e.g.,  stimulus  and  instrumen¬ 
tation  mechanisms,  reconfigurable  topologies,  alterable  communication  subnet  characteristics.  DATE  is  now 
completed  and  has  been  installed  as  well  at  NOSC  in  San  Diego. 


1.2.7  Decentralized  Computer  Architecture 

Decentralized  resource  management  should  not  be  considered  merely  an  OS  matter,  but  rather  a  system 
matter.  Our  operating  system  is  sufficiently  unusual  that  it  suggests  reconsidering  the  architecture  of  the 
processors  and  their  interconnection1. 

•  We  have  been  looking  at  two  aspects  of  the  processor  architecture  topic:  seperating  each  node  of 
the  computer  into  an  OS  part  and  an  application  part:  and  the  design  of  the  OS  part  This  allows 
existing  machines  and  application  software  to  be  retained,  while  at  the  same  time  the  OS  proces¬ 
sor  can  be  designed  expressly  to  facilitate  decentralized  resource  management  Concurrency  of  OS 
and  application  execution  also  improves  system  performance. 

o  The  first  aspect  began  with  an  extensive  evaluation  of  the  literature  on  architectural  support 
for  operating  systems;  substantive  subsequent  progress  remains  to  be  made  this  academic 
year. 

o  The  second  aspect  began  designing  the  OS  machine  architecture,  but  then  turned  to 
developing  an  improved  methodology  for  doing  so.  So  little  basic  scientific  and  engineering 
perspective  is  normally  used  to  design  and  evaluate  machines  that  we  felt  compelled  to 
intervene  in  that  respect.  Our  desire  was  only  to  do  a  good  job  on  our  own  machine,  but  by 
coincidence  the  general  topic  became  one  of  the  hottest  in  computer  architecture  (Le„  the 
’’RISC/CISC’  controversy).  Our  methodology  contribution  had  unexpected  impact  in  this 
controversy,  with  side  effects  on- the  community's  attitudes  and  perceptions. 

This  aspect  has  two  major  thrusts  going  forward:  separation  of  the  effects  of  register  struc¬ 
ture  from  those  of  instruction  set  complexity:  and  developing  a  systematic  approach  to 
functional  migration  (e.g,  from  software  to  microcode  or  hardware).  While  these  subtasks 
will  be  important  as  ends  in  the  field  of  computer  architecture,  to  us  they  are  primarily 
means  to  the  end  of  designing  our  own  OS  machine  named  Meta.  Both  subtasks  will  be 
completed  in  this  academic  year,  allowing  progress  on  Meta  to  resume. 

1.2.8  Interim  Testbed 

The  Archons  project  has  two  testbed  facilities  in  its  plans:  an  interim  one  based  on  commercially  available 
hardware  and  software;  and  a  later  one  of  our  own  hardware  and  software  design.  The  former  win  support 
experimental  research  with  both  decentralized  algorithms  (e.g^  process/processor  binding,  IPC,  transactions) 
and  decentralized  operating  system  structures.  This  experience  will  eventuaUy  lead  to  the  development  of  a 
testbed  which  wiU  aUow  experimental  research  on  not  just  design  but  also  implementation  of  both  software 
and  hardware. 

Chapter  7  provides  an  overview  of  the  current  Interim  Testbed  system.  The  main  software  requirements  of 
the  testbed  were  having  both  the  Unix  operating  system  and  a  lower  level  executive  (in  our  case,  BBhTs 
CMOS).  In  addition,  the  hardware  was  required  to  employ  the  Multibus  backplane  and  a  68010  processor. 


^Ths  work  a  sponsored  by  the  L'.S.  Army  Center  for  Tactical  Computer  Systems. 
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Thus,  wc  selected  the  Sun  workstation  for  both  technical  and  administrative  reasons.  At  this  point,  we  have 
reached  a  state  where  the  testbed  has  a  stable  minimal  hardware  and  operating  system  configuration  and 
completed  the  installation  of  CMOS  on  our  Sun  workstations.  The  explanation  of  the  system  selection  and 
the  current  status  of  the  Archons  testbed  system  are  included.  Our  plan  for  the  further  development  of  the 
testbed  is  also  described. 
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2.  ArchOS:  A  Decentralized  Operating  System 

2.1  Overview 

This  chapter  outlines  how  the  decentralized  resource  management  concepts  we  have  been,  and  are,  creating 
can  be  utilized  to  construct  an  operating  system  which  is  radically  different  from  current  practice  in  both 
principle  and  behavior.  This  initial  operating  system  is  named  ArchOS,  and  will  be  an  experimental  existence 
proof.  Not  only  its  objectives,  but  also  a  preliminary  development  plan,  are  described. 

2.2  ArchOS  Objectives 

2.2.1  Background 

Several  of  the  principals  in  the  Auctions  project  have  been  designing  and  implementing  a  variety  of  dis¬ 
tributed  systems,  primarily  in  military  /industrial  R&D  contexts,  for  as  long  as  14  years  (and  continue  to  do  so 
as  consultants  and  corporate  employees).  These  systems  have  been  relatively  innovative  in  many  respects,  yet 
as  conservative  as  need  be  for  traiisfer  of  the  technology  to  product  environments  without  excessive  risk. 
Experience  with  these  research  prototypes  and  their  progeny  products  exposed  us  to  many  of  the  critical 
problems  of  physically  and  logically  distributed  systems,  and  to  the  frustrating  limitations  of  trying  to  ade¬ 
quately  solve  these  problems  with  approaches  from  conventional  "umcentnc"  computers  and  "polycentnc" 
computer  networks.  As  is  all  too  often  the  case,  many  of  the  specific  details  of  these  systems  and  our 
experiences  remain  under  corporate  proprietary  and  military  classification  shrouds.  However,  they  dearly 
manifest  themselves  in  the  the  objectives  and  directions  of  the  Archons  project  and  its  ArchOS  operating 
system  effort 

A  university  normally  imposes  few  (but  not  necessarily  no)  "product"  pressure  constraints  -  the  amount 
depends  on:  the  extent  to  which  a  project  is  generating  a  stable  facility  for  users,  versus  research  results  per  se; 
the  project  sponsors),  and  contractual  relationships;  etc.  Being  in  an  academic  environment  which  is  suitably 
oriented  and  equiped  for  large  scale  experimental  hardware  and  software  research  (such  as  CMU’s  Computer 
Science  Department),  and  having  appropriate  DoD  and  industrial  sponsorship,  we  could  have  chosen  to 
immediately  apply  the  lessons  we  have  learned  from  the  work  of  ourselves  and  others  to  the  design  and 
implementation  of  an  adventurous  distributed  system  and  operating  system  (OS);  we  believe  that  it  would 
have  made  significant  contributions  to  the  field  (and  that  we  would  have  greatly  enjoyed  ourselves). 

Instead,  we  chose  to  embark  on  a  much  more  ambitious,  longer  term,  and  potentially  higher  payoff  research 
effort: 

•  first  seeking  visionary  new  resource  management  paradigm:  which  are  as  intrinsically 
" decentralised ”  as  we  could  conceive  of; 


•  then  employing  these  as  the  foundation  of  an  experimental  decentralized  OS: 


•  and  finally  utilizing  the  resulting  OS  concepts  and  techniques  to  perform 
hardwarc/firmwarc/software  implementation  tradeoffs  which  lead  to  a  second  generation  OS, 
and  the  hardware  design  of  our  own  optimal  architecture  for  it  - 

we  view  decentralized  resource  managment  as  a  system,  not  just  a  software,  effort  [Jensen  81a]. 

(This  raises  the  issue  of  whether  such  a  machine  is,  or  should  be.  a  complex,  as  opposed  to  a 
reduced,  instruction  set  computer:  our  choice  is  neither,  which  we  have  commented  on  in  [Colwell 
83[.) 

Evolution  is  generally  appropriate  as  the  primary  mode  of  computer  (and  other)  system  development,  but  it 
should  be  performed  with  much  careful  thought.  Almost  all  work  on  ’’distributed"  systems  in  general  and 
"distributed’Vnetwork  operating  systems  in  particular,  has  been  evolutionary  to  an  extreme  -  most  of  the 
resource  management  concepts  have  been  simple  adaptations  of  centralized  ones,  burdened  by  inappropriate 
and  even  counterproductive  artifacts.  The  ineffectiveness  of  constructing  airplanes  which  fly  by  flapping  their 
wings  was  recognized  early:  but  corresponding  realizations  about  distributed  systems  have  largely  not  yet 
taken  place,  as  we  have  argued  for  several  years  (e.g^  [Jensen  76])  and  briefly  review  in  Subsection  22.1. 

Exploring  new  frontiers,  especially  on  a  large  systems  scale,  can  be  not  only  exciting  and  stimulating,  but 
also  frustrating,  tiring,  and  even  dangerous.  A  paramount  source  of- the  unpleasantness  is  the  extensive 
duration  such  an  effort  can  entail: 

•  a  system  design  is  broad  -  the  number  and  intricacy  of  its  interacting  problems  rises  exponentially 
with  size; 

•  a  particularly  formidable  and  key  problem  may  consume  a  great  deal  of  time; 

•  an  additional  interval  may  elapse  before  it  is  prudent  to  disclose  one's  solution  to  a  particular 
problem. 

One  reason  for  the  disclosure  delay  is  that  an  unconventional  and  perhaps  controversial  result  might  be 
viewed  skeptically  as  more  of  a  conjecture  if  u  is  not  substantiated  by  evidence,  which  can  often  require 
concommitant  results  or  extensive  experimentation  (neither  of  which  may  be  complete  at  that  time).  In 
addition,  government  funding  for  building  large  scale  systems  is  rare  and  often  competitive,  most  of  the 
aspirants  commonly  being  corporations;  the  satisfaction  of  seeing  your  ideas  appreciated  through  their  ap¬ 
pearance  in  someone  else’s  proposal  may  '.>e  too  high  a  price  to  pay  for  losing  a  unique  opportunity  to 
implement  and  experiment  with  those  ideas  yourself.  (In  some  of  our  own  cases,  a  certain  degree  of  reticence 
has  doubtless  been  instilled  as  well  by  extensive  careers  in  military/industrial  research  where  the  incentive  to 
publish  ranges  from  marginally  positive  to  strongly  negative.) 
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Partially  as  a  consequence  of  these  delay  effects,  social  pressures  may  come  to  bear  -  one's  peers  (or 
management)  may  form  misconceptions  about  the  nature,  feasibility,  quality,  or  quantity  of  the  research. 
Employment  security  of  the  researchers  may  not  be  well  established,  adding  an  element  of  personal  risk  to  the 
venture.  Student  contributors  are  pursuing  degrees  and  expect  to  graduate  in  a  bounded  period  of  time. 

Students  also  have  academic  (and  social)  obligations  which  detract  from  the  amount  of  effort  they  are 
willing  and  able  to  commit  to  research,,  which  makes  fulltime  professional  (e.g^  post-doctoral)  personnel 
almost  essential  on  large  scale  experimental  projects.  But  supporting  professional  researchers  may  be  viewed 
by  some  as  contrary  to  the  pedagogical  imperative  of  academia. 

A  lengthy  research  project  must  also  be  able  to  adapt  to  relevant  (supportive  or  not)  results  from  other 
efforts;  assimilation  and  agility  help  prevent  obsolescence. 

And  ultimately,  there  is  always  the  nonzero  possibility  that  maps  which  claim  "Here  be  dragons”  or  people 
who  assert  that  "You  will  fall  off  the  edge  of  the  world”  may  be  (at  least  partly)  right  -  in  the  quest  for  new 
paradigms,  negative  results  arc  valuable,  frequently  more  so  than  positive  ones. 

We  felt  that  the  Aichons  principals  had  the  requisite  experience,  insight,  self-assurance,  physical  and 
emotional  endurance,  security,  intellectual  environment,  and  physical  facilities  to  accept  and  conquer  the 
"insurmountable  opportunities”  of  this  challenging  research  project.  (Our  feelings  in  these  respects  do 
continually  vary  through  the  course  of  our  research.) 

2.2.2  Decentralized  Resource  Management 

There  currently  seems  to  be  little  common  understanding  about  what  "distributed"  decision  making  or 
resource  management  means;  this  is  one  of  the  reasons  that  the  "distributed”  and  network  operating  system^ 
in  the  literature  and  laboratories  are  so  very  conceptually  and  functionally  disparate.  We  have  chosen  to  use 
the  term  "decentralization,”  and  to  attempt  to  rather  carefully  (albeit  not  formally)  define  what  we  mean  by  it 

"Centralized"  and  "decentralized"  are  not  usefully  viewed  as  a  dichotomy,  but  rather  as  the  endpoints  of  a 
continuum  ~  indeed,  as  diagonally  opposed  vertices  of  a  multidimensional  space.  We  are  not  under  a  miscon¬ 
ception  that  extreme  decentralization  is  necessarily  advantageous  in  all  ways  and  under  all  circumstances. 
However,  our  experience  (both  prior  to  and  subsequent  to  the  initiation  of  the  Archons  and  ArchOS  research) 
provides  cogent  arguments  and  concrete  evidence  that  movement  away  from  the  heavily  populated  highly 
centralized  subspaces  can  be  invaluable.  At  least  in  some  applications  we  are  familiar  with,  such  as  super¬ 
visory  real-time  control  (e.g„  combat  platform  management  and  factory  automation),  more  decentralization  of 
resource  management  offers  improvements  in  certain  system  attributes  like  robustness  and  modularity.  In 
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order  that  we,  or  any  designers  of  a  particular  system,  be  able  to  xictiiifically  position  ourselves  well  in  this 
space  from  maximally  centralized  to  maximally  decentralized,  far  too  little  knowledge  exists  today  about  its 
decentralized  boundary  conditions. 

It  became  apparant  to  us  that  these  issues  ought  to  be  dealt  with  explicitly  and  systematically  from  the 
ground  up.  not  in  the  prevalent  ad  hoc  adaptive  (albeit  safer  and  faster)  way,  if  the  many  attractive  promises 
of  a  physically  dispersed  computer  were  to  be  realized.  Thus,  we  launched  an  extensive  search  for  the  limits  of 
resource  management  decentralization,  divided  into  two  areas:  logical  and  physical.  (Computer  scientists 
sometimes  imagine  incorrectly  that  logical  things  are  innately  more  conceptually  interesting  than  are  physical 
things:  the  opposite  seems  true  to  us  in  this  case.) 

2.2.2. 1  Logically  Decentralized  Resource  Management 

One  of  our  first  steps  was  to  create  a  conceptual  model  of  the  space  of  logical  decentralization  of  decision 
making:  the  most  detailed  of  its  incarnations  can  be  found  in  [Jensen  81b].  It  can  be  applied  at  different  levels 
of  abstraction,  from  one  instance  of  one  decision  about  one  resource,  through  all  instances  of  all  decisions 
about  all  resources.  Our  model  is  germane  to  the  management  of  local  or  global  resources.  In  it,  decentraliza¬ 
tion  is  founded  on  multilateral  management;  not,  for  instance,  on  the  more  common  theme  of  resource  or 
functional  partitioning  (which  leads  to  autonomy  as  maximally  decentralized,  which  we  reject).  Our  model 
expresses  the  degree  of  decentralization  as  being  detennined  by  several  factors,  crudely  summarized  as  fol¬ 
lows: 

•  the  percentage  of  resources  involved; 

•  the  percentage  of  decision  makers  which  participate  (depending  on  the  level  of  abstraction  under 
consideration,  functional  partitioning  and  successive  techniques  such  as  round  robin  may  be 
placed  at  the  centralized  end  of  this  axis); 

•  the  extent  to  which  all  decision  makers  must  become  involved  before  a  decision  has  been  com¬ 
pleted  (note  that  resource  partitioning  and  functional  specialization  are  highly  centralized  by  this 
metric); 

•  the  degree  of  equality  of  decision  maker  authority  and  responsibility  (this  axis  places  a  premium 
on  peer  relationships,  in  contrast  with  the  ubiquitous  hierarchical  ones). 

To  this  version  we  subsequently  added  a  negotiation  axis,  which  we  summarize  in  print  here  for  the  first 
time. 

At  the  minimum,  more  centralized,  end  of  the  negotiation  axis,  each  decision  is  made  by  a  collection  of 
entities  which  work  as  a  "team"  to  move  the  overall  system  toward  its  goals.  Any  team  member  is  allowed  to 
make  certain  (not  necessarily  fixed)  decisions  without  necessarily  gaining  the  concurrence  of  the  other  team 
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members:  these  decisions  may  be  constituent  subdecisions  or  different  instantiations  of  the  same  decision 
(this  difference  is  represented  on  other  axes  of  the  model).  Any  team  member  may  seek  information  which 
will  improve  the  quality  of  its  decisions  [Marschak  72].  A  well-known  example  of  team  decision  making  is 
routing  in  the  ARPANET  communication  subnet  [Ahuja  82]. 

At  the  opposite,  more  decentralized  end  point,  each  member  in  a  collection  of  decision  makers  develops 
hypotheses  (deductions  and  assumptions)  with  associated  probabilities  -  these  may  be  based  on  some  form  of 
partitioned  competence  (designated  elsewhere  in  the  model)  or  disparity  of  information  (which  may  again  be 
a  logical  factor,  or  a  physical  one  as  discussed  in  the  following  subsection).  To  make  a  decision,  members 
exchange  these,  reason  about  and  modify  them,  making  compromises  as  necessary,  and  in  this  way  enhance 
the  marginal  viewpoints  to  a  more  global  view.  This  activity  must  somehow  converge  to  a  single  consensus 
decision,  perhaps  by  a  formal  method  such  as  that  of  DeGroot  [DeGroot  74],  or  by  heuristics  such  as  in¬ 
ference  rules  and  algorithms  -  the  latter  are  employed  by  some  areas  of  artificial  intelligence  such  as  problem 
solving  and  expert  systems  (e.g.,  the  HearSay  system  [Erman  73],  [Erman  79]).  (Note,  however,  that  HearSay 
was  quite  centralized  by  our  standards:  logically,  because  the  knowledge  sources  were  functionally  special¬ 
ized;  and  physically,  due  to  the  shared  global  state  "blackboard”.)  This  technique  is  somewhat  similar  in  spirit 
to  the  "divide  and  conquer”  approach  of  algorithm  design  but  lacks  the  optimality  of  the  full  Bayesian 
method,  because  the  joint  information  has  been  sacrificed.  That  is,  the  various  inter-dependences  are  only 
approximated  by  the  indirect  approach  of  reaching  a  consensus. 

We  are  interested  in  the  conditions  under  which  different  degrees  of  logical  decentralization  according  to 
our  model  offer  how  much  of  which  attributes,  and  the  tradeoffs  involved,  in  managing  global  resources.  But 
the  region  of  primary  interest  to  us  in  this  multidimensional  space  of  logical  decentralization  is  where  each 
global  decision  is  made  multilaterally  by  a  group  of  peers  through  negotiation,  compromise,  and  consensus. 

According  to  our  view,  most  resource  management  is  highly  logically  centralized,  even  in  the  myriad 
network  and  distributed  operating  systems  we  are  aware  of. 

2. 2. 2. 2  Physically  Decentralized  Resource  Management 

We  have  long  argued  that  the  important  benefits  of  having  system-wide  resource  management  at  the 
operating  system  level,  routinely  provided  by  a  computer,  are  not  available  to  many  systems  -  the  reason  is 
that  those  systems  consist  of  multiple  nodes  which  must  be  physically  dispersed  (for  functionality,  reliability, 
and  logistical  reasons). 

Unfortunately,  operating  systems  as  presently  conceived  are  highly  and  inherently  centralized  in  several 
critical  respects.  Perhaps  most  importantly,  they  are  based  on  some  very  strong  premises  about  time  -  e.g.. 
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that  communication  delays  4u£  IQ  physical  dispersal  within  the  operating  system  arc  practically  negligible 
with  respect  to  the  rate  at  which  the  system  state  changes  (note  that  the  same  effect  can  occur  on 
VHSIC/VLSIC  chips).  This  leads  to  the  presumption  that  it  is  possible  (and  even  cost-effective)  for  all 
processes  to  share  as  complete  and  coherent  a  view  of  the  entire  system  state  as  may  be  desired  (e.g„  that  a 
single  global  ordering  of  events  can  be  established).  Another  class  of  centralized  operating  system  premises 
has  to  do  with  the  types,  frequencies,  and  effects  of  faults,  errors,  and  failures.  Both  the  time  and  fault 
premises  arc  rational  given  the  historical  evolution  of  operating  systems  in  the  context  of  shared  primary 
memory  (i.e..  uniprocessors  and  multiprocessors).  Unfortunately,  many  of  these  premises  go  unstated  (e.g..  in 
operating  system  texts  and  papers),  and  arc  either  forgotten  or  assumed  to  unquestionably  always  hold. 

Our  focus  is  on  achieving  the  global  executive  level  resource  management  for  a  physically  dispersed  system 
to  be  a  computer  in  the  same  sense  that  a  uniprocessor  or  multiprocessor  is.  However,  we  are  not  restricting 
ourselves  to  virtual  uniprocessors  -  is  it  frequently  beneficial  (e.g.,  improved  fault  recovery  and  performance) 
for  some  image  of  the  composite  and  decentralized  structure  of  the  software  or  hardware  to  (occasionally  or 
optionally)  be  made  or  left  visible  to  the  user. 

Presently,  Archons  appears  to  be  essentially  alone  in  stressing  unification  at  the  operating  system  levels;  the 
dominant  theme  in  distributed  system  projects  today  is  "autonomy."  The  only  popular  alternative  to  conven¬ 
tional  centralized  computers  is  computer  networks.  A  conventional  generic  computer  network  can  be  charac¬ 
terized  as  follows: 

•  each  computer  is  (functionally  and  often  administratively)  autonomous  with  its  own  local,  central¬ 
ized  operating  system; 

•  all  the  computers  are  connected  to  a  communications  subnetwork; 

•  each  computer  has  network  server  utility  software  (for  transport  protocols,  naming  conventions, 
and  the  like)  sufficient  for  them  to  do  resource  sharing  (e.g,  file  transfers,  mail,  virtual  terminals); 

•  there  may  be  higher  layers  of  software  for  specific  applications  (e.g,  banking,  military  C3).  per¬ 
haps  giving  the  users  some  unified  perception  of  the  system. 

A  network  normally  is  supplied  with  a  so-called  "network  operating  system",  which  tends  to  simply  be  the 
collection  of  network  server  utilities.  Historically  it  has  been  consirainted  to  being  a  guest  of  the  local 
operating  systems;  recently,  more  indigenous  (and  thus  more  effective)  network  operating  systems  are 
developing  (e.g,  [Rashid  81J).  A  few  recent  networks  and  their  network  operating  systems  aspire  to  even¬ 
tually  make  gradual  movement  in  the  direction  of  greater  operating  system  coordination  (e.g,  [Spice  79)). 

Many  applications  need  nothing  more  than  long-haul  resource-sharing  or  value-added  networks,  or  local 
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area  networks  of  personal  workstations.  But  not  in  the  eases  of  concern  to  us:  the  very  complex  problems  of 
achieving  systcm*widc  resource  management  arc  forced  up  to  the  user  level  where  they  arc  more  difficult 
(having  less  access  to  lower  level  resources  and  receiving  little  assistance  and  perhaps  even  resistance  from  the 
local  operating  systems):  and  where  they  must  be  solved  repeatedly  if  there  are  multiple  users,  instead  of  once 
by  the  system  designers.  The  unsurprising  consequence  is  that  these  applications  with  substantive  state 
changc/visibility  rauos  most  suffer  because  of  the  solopsisuc  local  operating  systems,  system  robustness  is 
poor,  modularity  is  compromised,  perfonnance  (e.g^  concurrency)  is  reduced,  and  total  system  cost  is  in¬ 
creased. 

We  fought  with  the  dilemmas  of  this  dichotomy  between  computers  and  computer  networks  during  the  past 
decade  of  our  experience  designing  distributed  systems  and  realized  that  having  a  physically  dispersed  com¬ 
puter  requires  a  functionally  singular  operating  system  (as  opposed  to  a  network  of  independent  private 
operadng  systems).  The  primary  obstacles  to  be  overcome  are  that: 

•  communicadon  within  the  operating  system  is  inaccurate  and  incomplete  with  respect  to  system 
state  changes; 

•  and  the  types  and  effects  of  faults,  errors,  and  failures  encountered  in  muldnode  physically  dis¬ 
persed  systems  differ  significantly  in  both  degree  and  kind  from  those  in  single  node  systems  -  this 
is  even  more  pronounced  in  a  decentralized  computer  than  in  a  computer  network. 

3.1.2.1  Accomodating  Imperfect  Information 

High  degrees  of  physical  decentralization  imply  that  resource  management  decisions  routinely  must  be 
"best  effort",  based  on  imperfect  quantity  and  quality  of  information  -  the  virtually  ubiquitous  "garbage  in, 
garbage  out”  characterization  of  computers  is  unrealistic  and  cannot  be  tolerated  in  a  physically  dispersed 
multinode  computer.  This  perspective  is  somewhat  familiar  above  the  OS  levels  (e.g„  in  certain  artificial 
intelligence  work),  and  below  them  (e.g„  in  dynamic  communication  packet  routing).  But  at  the  OS  levels  (as 
in  most  software),  it  is  a  foreign  outlook  which  is  incompatible  with  the  current  state  of  the  art.  Consequently, 
new  problems  have  to  be  solved  in  the  design  of  the  decision  algorithms,  such  as:  picking  thresholds  of  result 
acceptability,  and  specifying  them  to  the  decision  makers;  determining  what  "value”  the  completeness  and 
accuracy  of  information  utilized  contributes  to  the  "quality"  of  a  decision  result. 

In  thinking  about  these  issues,  it  becomes  clear  that  while  the  logical  and  physical  aspects  of  decentralized 
decision  making  are  conceptually  distinct,  they  strongly  interact.  For  example,  the  decision  convergence  time 
may  include  acquiring  suitably  valuable  quantity  and  quality  of  information,  as  well  as  negotiating. 

An  unavoidable  characteristic  of  a  physically  dispersed  machine  is  a  significant  increase  in  the  indeter- 
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minism  of  its  behavior.  The  centralized  mind  set  is  that  not  only  ends  but  also  means  at  all  levels  ought  to  be 
entirely  deterministic:  it  is  normally  affordable  to  closely  approximate  this  in  a  centralized  machine,  and 
instances  to  the  contrary  arc  dealt  with  in  various  ad  hoc  fashions.  Our  position  is  that  considerable  indeter¬ 
minism  is  the  normal  case  in  decentralized  resource  management,  and  can  be  exploited  to  advantage  (e.g., 
improved  robustness  and  performance)  rather  than  merely  tolerated.  Dynamic  packet  routing  demonstrates 
our  position,  and  we  have  done  so  (transparantly  to  the  users)  in  a  network  operating  system  ( [Sha  83]). 

3.1.12  Faults.  F.rrors.  and  Failures 

In  a  physically  dispersed  multinode  system  (whether  network  or  computer),  reliability  problems  are  worse 
than  in  nondispersed  and  uninode  systems,  particularly  when  considered  in  light  of  the  imperfect  information 
issues  discussed  above.  For  example,  concurrency  control  and  failure  atomicity  become  vastly  more  compli¬ 
cated,  far  beyond  the  realm  of  current  centralized  operating  system  conceptions.  Computer  networks  have 
more  relevant  technology  in  this  respect,  but  most  of  it  is  actually  inspirational  rather  than  directly  trans¬ 
ferable  to  a  decentralized  OS  and  computer. 

We  address  failure  management  and  recovery  within  an  operating  system  by  thinking  of  the  OS  state  as  a 
special  kind  of  distributed  database  which  is  approximately  replicated  at  each  node.  This  suggests  that  an 
atomic  transaction  facility  [oillSfi  111  QS  (and  perhaps  by  higher,  e.g.,  application,  levels)  be  incorporated 
in  each  instance  of  its  kernel  ( [Jensen  81c],  although  we  made  this  pivotal  design  decision  in  1978  as  a  result 
of  several  enlightening  discussions  with  Gerard  Le  Lanfl  about  his  distributed  database  concurrency  control 
research).  As  a  consequence,  three  classes  of  significant  research  issues  arise. 

One  is  that  both  the  services  and  structure  of  the  OS  ought  to  be  substantively  affected  by  the  availability  of 
atomic  transactions  as  kernel  primitives.  This  is  essentially  virgin  territory:  the  few  approximations  to  atomic 
transactions  at  the  OS  level  have  been  ad  hoc;  in  fact,  they  have  not  been  explicitly  viewed,  designed,  and 
exploited  as  transactions. 

Secondly,  the  overhead  (especially  communication)  of  atomic  transactions,  which  is  always  a  concern, 
becomes  of  paramount  importance  at  the  OS  kernel  level.  We  have  determined  that  one  can  achieve  great 
acceleration  without  degrading  flexibility  with  thoughtfully  crafted  hardware  mechanisms  at  the  disposal  of 
software  modules  which  establish  the  desired  policies. 

The  third  class  of  research  issues  has  to  do  with  the  need  for  insightful  reconsideration  of  atomicity  itself. 
While  the  inclusion  of  atomic  transactions  in  our  OS  was  inspired  by  their  contributions  to  conventional 
database  systems,  our  transaction  facility  differs  radically  in  several  respects  from  those  used  in  that  context. 
Examples  include  the  following 
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The  conventional  scrializability  theory  of  concurrency  control  includes  the  assumption  that  consistency  and 
correctness  result  from  a  single  transaction  executing  alone:  this  leads  directly  to  the  same  result  for  all 
serializable  schedules  An  advantage  of  scrializability  is  that  it  is  completely  general  and  works  without 
requiring  knowledge  about  either  the  database  or  the  transactions:  but  a  disadvantage  is  that  it  cannot  exploit 
such  knowledge  which  may  be  available  in  any  specific  case  (such  as  in  an  OS).  The  cost  of  this  generality  is 
that  scrializability  can  exclude  consistent  and  correct  schedules  which  provide  higher  concurrency  than  those 
it  permits.  Database  researchers  have  begun  to  study  nonserializable  consistency  control  methods  in  search  of 
greater  concurrency,  but  a  major  difference  is  that  a  transaction  can  no  longer  be  regarded  as  if  it  were 
executing  alone.  Conscquendy.  the  consistency  and  correctness  properties  of  nonserializable  scheduling  rules 
do  not  follow  automatically,  they  must  be  proven.  So  that  every  attempt  to  utilize  nonscrializability  is  not 
burdened  with  inventing  its  own  rules  and  proving  their  properties,  a  formal  theory  of  nonserialzable  concur¬ 
rency  control  is  needed.  Because  it  is  already  known  that  serializability  theory  provides  the  highest  degree  of 
concurrency  possible  when  using  only  the  classically  defined  transaction  syntax,  some  additional  kind  of 
information  is  required  if  nonscrializabilty  is  to  perform  better.  Researchers  other  than  ourselves  seem  to  be 
focused  exclusively  on  exploitation  of  transaction  semantics,  developing  syntactic  structures  to  support 
programmers'  specification  of  their  own  application-dependent  scheduling  rules.  All  programmers  involved 
with  any  given  database  must  understand  the  details  of  each  other’s  transactions,  and  every  programmer  is 
responsible  for  the  consistency  and  correctness  properties  of  his  own  rules.  Such  extensive  use  of  global 
transaction  semantics  (e.g„  the  "break  point  specifications"  in  [Lynch  83]  and  "lock  compatibility  tables"  in 
[Schwarz  82])  allows  very  high  degrees  of  concurrency,  but  appears  to  limit  this  approach  to  rather  static  and 
specific  situations.  Modularity  is  recognized  to  be  extremely  valuable  in  software  engineering  generally;  we 
consider  it  a  critical  attribute  in  transaction-based  distributed  computations,  particularly  decentralized  operat¬ 
ing  systems.  Therefore,  we  have  created  a  different  and  more  decentralized  theory  of  nonserializable  concur¬ 
rency  control  which  seems  improved  with  respect  to  modularity;  when  a  programmer  schedules  one  of  his 
transactions,  he  need  know  only  the  agreed  upon  transaction  syntax,  the  details  of  his  own  transaction,  and 
the  consistency  constraints  of  the  database  subset  affected  by  this  transaction.  We  define  a  new  transaction 
syntax,  called  compound  transactions  (of  which  nested  transactions  are  a  special  case),  and  its  associated 
generalized  setwise  serializable  scheduling  rules  (of  which  senalizabtiity  is  a  special  case).  Our  schedules  are 
complete  in  the  sense  that  for  any  consistency  and  correctness  preserving  schedule,  there  exists  an  equally 
consistent  and  correct  setwise  serializable  schedule  which  provides  at  least  as  much  concurrency. 

An  important  implication  of  our  different  approach  to  transactions  is  that  failure  management  and  recovery 
must  be  re-evaluated.  The  usual  notion  of  failure  atomicit>  is  drawn  from  senaiizability  theory,  where  a 
transaction  cannot  be  commited  until  all  its  actions  are  successful  and  in  stable  storage.  Higher  concurrency 
can  be  achieved  by  determining  conditions  which  permit  a  transaction  to  commit  completed  steps  before  the 
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end  of  the  transaction.  Our  concept  of failure  safety  is  based  on  such  conditions,  in  [Sha  84],  we  formalize  our 
theory,  and  its  properties  of  consistency,  correctness,  modularity,  optimality,  completeness,  and  failure  safety. 


Other  major  departures  we  must  make  from  normal  atomic  transaction  facilities  include: 

•  Instead  of  being  above  an  OS  with  the  corresponding  functionality  to  draw  on.  our  transaction 
facility  is  beneath,  inside  the  OS  kernel.  This  affects  the  facility  design  substantially. 

•  Rather  than  handling  simple  database  objects  such  as  records  and  files,  it  must  accomodate  the  far 
more  complex,  abstract,  and  dynamic  data  types  found  in  an  OS. 

•  An  object  is  not  necessarily  located  at  a  single  node  -  a  single  instance  of  it  may  be  physically 
dispersed  across  multiple  nodes 


Note  that  the  work  outlined  above  has  applicability  beyond  our  motivation  to  enable  the  creation  of  a 
logically  and  physically  decentralized  operating  system  (and  computer)  which  is  extremely  reliable  and 
modular. 


2.2.3  Other  Objectives 

In  this  subsection,  we  cover  some  salient  objectives  of  distributed  system/OS  projects  which  are,  and  are  not 
factors  in  the  Archons  and  ArchOS  effort  —  this  will  help  familiarize  the  reader  with  our  research  and 
distinguish  it  from  the  other  work  in  this  general  field. 

2.2.3. 1  Research  PerSe  Versus  Facility  Development 
The  Archons  system  and  its  ArchOS  operating  system  are  vehicles  for  our  own  research  in  decentralized 
resource  management  -  this  has  three  major  ramifications: 

•  First,  and  most  important,  is  that  projects  (e.g^  Spice  here  in  the  Computer  Science  Department, 
and  many  others  in  progress  elsewhere)  which  are  intended  to  result  in  a  general  computational 
facility  necessarily  have  shorter  term  schedules,  and  thus  scope  and  risk  constraints  far  more 
conservative  than  ours.  ' 

•  Second,  we  have  no  desire  to  be  compatible  with  anything;  in  particular,  ArchOS  is  not  compelled 
to  present  its  users  with  a  UNIX  interface. 

•  Finally,  the  Archons  and  ArchOS  hardware  and  software  are  privately  owned  and  operated  by  the 
Archons  project  rather  than  by  the  Computer  Science  Department's  research  facilities  group 
(although  we  are  very  grateful  for  their  kind  cooperation  and  assistance).  While  we  consequently 
lose  the  valuable  committed  support  of  that  group,  we  are  also  are  not  subject  to  Department 
logistical  policies  regarding  permissible  hardware  and  software. 


ArchOS  is  an  experimental  prototype  which  will  serve,  among  other  things,  as  an  existance  proof  that  our 
new  resource  management  paradigms  arc  valid  and  that  is  possible  to  base  an  OS  on  them  -  if  obliged  to,  we 
will  treat  cost-effectiveness  and  even  feasibility  as  almost  second-order  effects. 


Wc  expect  to  include  only  a  subset  of  the  services  usually  associated  with  an  OS:  those  for  which  we  have 
designs  or  implementations  that  meet  our  research  objectives  (c.g„  process  to  node  binding):  or  those  which 
wc  need  ourselves  (regardless  of  how  centralized  or  decentralized  we  do  them).  Everything  else  will  be 
considered  dispensible. 

The  design  and  implementation  of  ArchOS  will  be  in  a  constant  state  of  flux,  and  will  not  present  stable 
facilities  to  its  users.  Some  of  our  research  sponsors  will  possess  copies  of  ArchOS,  but  wc  are  obviously 
unwilling  and  unable  to  support  them  in  the  field;  at  least  two  of  these  sponsors  expect  to  maintain  some 
version  of  ArchOS  themselves,  and  IBM  has  expressed  a  willingness  to  consider  providing  ArchOS  support 
for  the  others. 

2.2.3. 2  Large  Scale  Experimental  Computer  System  Research 

An  unfortunate  limitation  of  most  U.S.  university  computer  science  (CS)  departments  is  their  inability  to 
conduct  large  scale  experimental  research.  NSF  and  other  national  (and  even  a  few  state)  government 
agencies,  together  with  some  industrial  corporations,  are  trying  to  help  remedy  this,  but  still  very  few  CS 
departments  have  the  requisite  facilities  and  conducive  environment  This  is  particularly  true  for  computer  as 
contrasted  with  software,  systems;  even  most  EE  departments  suffer  in  this  respect.  Given  the  facilities  and 
environment,  there  is  still  the  choice  of  research  style  to  be  made  (see  Section  22.1).  For  want  of  either 
opportunity  or  desire,  most  computer  scientists  in  the  computer  systems  area  do  not  design,  implement,  and 
experiment  with,  large  scale  systems.  Numerous  distributed  systems  of  various  types  are  being  constructed, 
but  many  are  small  in  one  or  more  respects,  and  virtually  all  are  entirely  software  efforts  utilizing  existing 
commercial  computers  (typically  ranging  from  LSMl’s  to  VAX’s)  and  interconnection  hardware  (usually  an 
Ethernet).  It  is  interesting  that  except  for  certain  military  systems,  this  characterization  holds  for  industrial  as 
well  as  academic  distributed  systems. 

One  of  the  objectives  of  Archons  which  most  differentiates  it  from  other  distributed  system  research  is  our 
willingness,  desire,  and  indeed  determination,  to  reflect  our  unconventional  OS  in  the  architecture  of  the 
hardware  (both  processors  and  interconnection)  it  runs  on. 

Initially,  we  are  employing  the  CS  Department  VAX’s  plus  our  own  interim  testbed  for  algorithm  experi¬ 
ments,  simulations,  and  software  development  A  project  facility  was  required  in  addition  to  the  department 
one  because: 

•  some  of  our  concept  and  algorithm  experiments  would  be  distorted  by  system  sharing  (e.g„ 
Ethernet  traffic); 

•  other  experiments  would  be  interfered  with  by  the  VAXs’  UNIX  operating  systems  -  we  need  the 
freedom  to  substitute  a  simple  executive; 
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•  ArchOS  is  native  (i.e„  executes  on  the  bare  hardware). 

'there  arc  two  reasons  for  beginning  with  interim  hardware:  the  requirements  for  the  Archons  decentralized 
computer  hardware  arc  primarily  generated  by  ArchOS.  which  has  not  yet  progressed  far  enough  for  its 
underlying  needs  to  be  dear,  and  even  if  we  knew  now  exactly  what  those  needs  were,  the  hardware  effort  is 
itself  very  large  scale  and  will  require  considerable  time  to  complete. 

Our  interim  testbed  is  deliberately  conventional  network  technology.  The  selection  requirements  for  it 
were:  off  the  shelf  computing  and  connection  hardware  to  ensure  immediate  availability;  the  use  of  a  Mul¬ 
tibus  backplane,  so  that  we  could  do  our  own  system  integration  and  modifications  as  desired  (e.g„  mul¬ 
tiprocessor  nodes,  spedal  support  hardware  boards);  Berkeley  UNIX,  for  compatibility  with  the  CS  Depart¬ 
ment  VAX’s,  and  for  its  networking  and  other  enhancements;  the  processor  being  a  68010,  for  software 
development  tools  and  other  software  availability.  These  led  uniquely  to  the  Sun  Microsystems,  Inc.  products. 

Our  eventual  Archons  decentralized  computer  will  be  highly  unconventional  in  essentially  every  respect. 
For  example,  each  node  consists  of  an  application  subsystem  and  a  resource  management  subsystem  -  the 
user  programs  execute  in  the  application  subsystems  (which  may  be  heterogeneous  and  whatever  the  applica¬ 
tion  calls  for);  ArchOS  executes  in  the  resource  management  subsystem  which  is  based  on  a  very  unusual 
machine  of  our  own  design  (named  Meta),  optimized  for  decentralized  resource  management  and  the  cor¬ 
responding  attributes  we  seek. 

2. 2.3.3  Application  and  Attributes 

The  application  environment  of  principle  interest  to  Archons  and  ArchOS.  at  least  initially,  is  ‘’upscale'' 
supervisory  real-time  control  -  e.g.,  military  combat  platform  management,  large  scale  industrial  factory 
automation.  It  is  rare  for  an  academic  project  to  focus  on  real-time  applications:  faculty  and  students  have 
little  exposure  to.  and  thus  understanding  of,  this  class  of  problems;  and  there  is  sometimes  a  sociological 
tendency  to  avoid  working  directly  on  projects  of  military  significance  (even  though  nearly  100%  of  academic 
computer  science  and  engineering  research  is  funded  by  the  DoD,  and  is  useable  by  the  DoD  regardless  of 
who  funded  it).  The  real-time  control  environment  offers  both  simplifications  and  complications.  The  former 
is  that  such  systems  are  typically  dedicated  function,  implying  that  at  least  some  of  the  resources  can  be 
managed  in  a  more  static  style  than  possible  in  general  purpose  systems;  at  any  particular  state  of  the  OS  art, 
this  may  make  the  difference  between  being  able  to  perform  a  function  in  a  highly  decentralized  manner  and 
not  (explaining  why  so  many  of  the  most  interesting  distributed  systems  have  been,  and  continue  to  be,  found 
in  the  military  real-time  control  field,  albeit  hidden  from  public  sight).  The  latter  arises  from  the  essence  of 
real-time  control,  that  resource  management  must  be  time  driven  -  so  mere  existance  of  the  functionality 
doesn't  suffice,  it  must  meet  deadlines  as  well.  We  find  this  combination  of  opportunity  and  challenge  ideal 


and  in  fact  irresistible  when  combined  with  the  eagerness  of  military  customers  to  not  just  fund  but  ex¬ 
perimentally  apply  innovative  systems  such  as  we  want  to  create. 

While  it  may  S£Sm  contradictory  to  our  selection  of  real-time  control  as  a  application  environment,  perfor¬ 
mance  (especially  in  the  throughput  sense)  is  m  one  of  the  more  important  properties  we  seek  to  attain  with 
highly  decentralized  operating  systems  and  computers.  The  contradiction  is  an  illusion,  because  we  are 
designing  the  response  time  driven  precept  as  a  fundamental  characteristic  into  our  resource  management 
principles,  algorithms.  OS.  and  system.  The  actual  magnitudes  of  the  deadlines  any  particular  design  or 
implementation  can  handle  is  of  less  significance  to  us,  and  will  be  the  subject  of  subsequent  performance 

optimization  work. 

Moreover,  we  perceive  that  performance  of  most  systems  will  improve  automatically  (and  rapidly)  with 
advances  in  semiconductor  technology.  But  we  can  expect  little  if  any  assistance  from  semiconductor  tech¬ 
nology  in  areas  of  equal  or  greater  importance,  such  as  fault  tolerance  and  modularity  -  these  arc  what 
computer  systems  research  ought  to  attempt  to  improve.  The  common  bias  toward  performance  without 
acknowledging  what  is  being  traded  for  it  (e*.  the  reduced  instruction  set  computer  controversy)  is  not 
because  performance  is  so  much  more  important  than  other  system  attributes,  but  in  our  opinion  because  it  is 

so  much  easier  to  attain  and  measure. 

2.3  ArchOS  Development  Plan 

This  subsection  discusses  some  of  the  methods  we  are  employing  and  documents  we  are  producing  in  the 
course  of  developing  the  initial  version  of  ArchOS.  The  overall  progression  is  illustrated  in  Figure  2-1.  While 
this  methodology  isn’t  as  elaborate  as  good  industry  practice,  it  appears  to  be  far  more  extensive  than  that 
conducted  in  other  academic  distributed  system  projects.  This  reflects  not  only  the  industrial  background  of 
several  of  the  Archons  principals,  but  also  the  fact  that  the  scope  and  complexity  of  the  ArchOS  research 
demands  careful  program  management  We  believe  that  our  software  specification  techniques  make  some 
novel  technological  contributions  of  their  own. 

2.3.1  Research  Requirements 

An  OS  development  effort  is  usually  launched  with  the  presentation  of  a  Requirements  Specification 
document  that  emphasizes  the  performance  and  resource  utilization  aspects  of  the  OS  (e.g^  response  times 
and  storage  restrictions),  some  of  the  key  internal  structural  requirements  (e.g„  a  tree  structured  file  directory 
system),  and  certain  characteristics  of  the  users’  interface.  The  development  then  proceeds  to  optimally  meet 
these  requirements  by  any  means  subject  to  various  program  management  constraints. 


Archon’s  Researchers  &  Sponsors 


RR:  Research  Requirements 

Cl:  Client's  interface  Specification 

SFS:  System  Functionality  Specification 

SAS:  System  Architectural  Specification 

SDS:  System  Design  Specification 

CDS:  Component  Design  Specification 


Figure  2*1:  ArchOS  Development  Steps 


Wc  felt  that  ArchOS  needed  such  a  directing  document,  but  one  with  a  completely  different  emphasis.  It 
must  confine  the  development  of  ArchOS  so  as  to  assure  that: 

•  ArchOS  will  satisfy  the  research  objectives  of  the  Archons  researchers  and  sponsors.  These 
involve  understanding  the  characteristics  and  costs  of  decentralized  (as  we  have  defined  it)  operat¬ 
ing  system  resource  management.  They  do  not  involve  client  convenience  or  performance  op¬ 
timization. 

•  ArchOS  will  have  no  centralized  implementations  of  the  OS  functions  (at  any  level) 

o  because  they  arc  familiar,  or 

o  because  they  are  "obviously  the  best”  (usually  in  a  performance  sense),  or 
o  because  they  are  allowed  to  creep  in  unintentionally. 

•  ArchOS  will  neither  build  upon  nor  offer  any  mechanisms  that  are  based  on  unfounded  assump¬ 
tions  carried  over  from  our  experiences  with  centralized  systems. 

•  ArchOS  will  provide  complete  internal  observability  to  the  experimenters  (but  not  to  the  applica¬ 
tion  level  clients).  As  a  vehicle  for  experimental  research,  ArchOS  must  readily  reveal  the  kinds  of 
data  that  make  experiments  meaningful. 

•  ArchOS  will  support  change  of  both  facilities  and  implementations.  So  little  is  known  about  the 
nature  of  decentralized  resource  management  that  we  must  anticipate  the  need  to  implement  and 
evaluate  alternative  approaches. 


Our  Research  Requirements  document  captures  these  notions,  enumerates  specific  sponsor  requirements, 
and  supplies  check  lists  to  be  applied  during  the  review  of  each  subsequent  work  product.  We  are  not  calling 
this  document  a  specification  because  we  feel  that  it  will  probably  be  impossible  to  measure  the  degree  of 
compliance  to  many  of  the  items  that  it  contains. 


2.3.2  Clients  Interface  Specification 

We  are  using  the  Research  Requirements  to  establish  the  clients’  view  of  ArchOS,  which  is  being  recorded 
in  the  Gients  Interface  Specification  document.  This  document  defines  the  entire  external  interface  available 
to  a  client  process,  and  specifies  ArchOS's  behavior  as  observed  at  that  interface.  Because  ArchOS  is  expected 
to  manage  all  the  system  (global)  resources,  this  is  also  the  clients'  view  of  the  decentralized  computer  system. 

Having  the  clients’  interface  possess  features  that  would  make  it  convenient  to  be  used  interactively,  by  a 
person,  is  a  low  priority  concern  to  us  at  this  time.  Our  driving  concern  is  for  the  interface  to  be  rich  enough 
to  allow  the  needs  of  the  clients  to  be  completely  conveyed  to  ArchOS.  It  is  meaningless  to  use  a  phrase  like 
"best-effort",  if  it  is  based  on  pre-determined  notions  of  the  ArchOS  designers  instead  of  on  information  that 
can  only  come  from  the  application  using  the  system. 


Wc  believe  that  an  OS  that  forces  the  client  processes  of  a  system  to  know  the  details  of  the  system's 
resources  places  all  responsibility  for  reliability  and  availability  on  the  application.  That  is,  if  the  clients  have 
to  base  their  OS  requests  on  the  initial  configuration  of  hardware,  then  they  similarly  must  base  these  requests 
on  the  current  state  of  the  system  when  it  is  running  in  a  degraded  mode.  Even  worse,  it  would  force  them  to 
account  for  the  fact  that  the  "current  state  of  the  system"  could  look  different  to  each  user  process,  because  of 
unknown  and  variable  communication  delays.  This  illustrates  why  we  must  treat  phrases  like  "current  state  of 
the  system"  as  being  (at  best)  probabilistically  defined. 

The  clients'  view  must  not  depend  upon  the  structure  of  ArchOS,  the  mechanisms  that  implement  ArchOS, 
or  the  internal  strategies  used  by  ArchOS.  Allowing  such  dependencies  would  necessarily  corrupt  the  validity 
of  data  collected  in  comparative  experiments.  Consider  comparing  mechanisms  X  and  Y  within  ArchOS.  We 
would  have  a  set  of  "application"  processes  (some  driver  programs  to  exercise  ArchOS)  that  would  have  to  be 
changed,  if  their  interface  with  ArchOS  depended  on  the  choice  of  X  or  Y.  Ignoring  the  undesirability  of 
having  to  develop  two  sets  of  driver  programs,  we  would  still  be  faced  with  the  problem  of  showing  (or  even 
believing)  that  both  programs  represent  comparable  stress  on  the  system. 

For  all  these  reasons,  we  will  use  only  the  Research  Requirements  to  define  the  clients’  view  of  ArchOS. 
This  (Le.,  not  using  a  vocabulary  that  reflects  the  inner  structure  of  the  OS)  will  be  a  novel  approach  in  the 
specification  of  OS  services.  We  think  such  an  approach  may  have  value  even  for  a  uniprocessor  OS,  when  in 
a  system  where  all  the  users  are  cooperating  to  meet  a  common  goal.  We  will  define  an  interface  where  the 
client  expresses  his  needs,  but  not  how  ArchOS  is  expected  to  meet  them.  We  envision  requests  that  supply 
information  like:  I  need  a  place  to  store  information,  and  the  value  to  me  of. 

•  acquiring  this  place  in  a  time,  Tr ,  is  Tr\ 

•  acquiring  a  place  to  store  amount,  S,  of  information  is  V2(S), 

•  acquiring  a  place  that  has  an  average  access  time,  Ta ,  is  K,(  Ta), 

•  and  so  forth,  for  things  like  expected  survivability  over  time,  protection  from  (or  accessibility  by) 

others,  behavior  of  the  storage  place  in  the  event  that  I  (the  requestor)  crash,  behavior  of  the 
system  in  the  event  that  the  information  is  lost  (e.g„  notify  me,  kill  me) . 

Note  that  such  a  storage  request  could  be  satisfied  with  classical  GeiMain,  or  AllocateFile,  or  GetMains 
(primary  and  backup,  in  another  failure  domain)  or  GetMain  with  AllocateFile  (backup),  or  AllocateFile^ 
(primary  and  backup,  in  another  failure  domain). 

It  is  likely  that  each  of  the  value  functions  will  be  accompanied  with  minimum  acceptable  anc  maximum 
useful  values.  ArchOS  would  be  expected  to  satisfy  the  request,  if  it  can  achieve  the  minimum  value  for  each 
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function.  To  the  extent  that  the  request  can  be  satisfied  in  several  ways.  ArchOS  would  be  expected  to 
maximize  the  total  value  achieved  without  exceeding  any  function's  maximum  useful  value.  One  possibility, 
if  the  request  can't  be  satisfied,  would  be  to  maximize  the  total  value  to  the  system.  This  might  require  that 
resources  be  taken  from  another  client  and  used  to  satisfy  this  request  (note  chat  we  assume  all  clients  are 
cooperating  to  achieve  a  single  goal,  and  the  value  functions  they  supply  should  reflect  this). 

We  expect  to  be  able  to  generalize  such  a  scheme  so  that  it  could  mimic  any  strategy,  of  which  we  are  aware, 
for  handling  situations  where  there  are  insufficient  resources  to  satisfy  all  the  clients'  needs.  We  arc  unsure  of 
the  degree  to  which  the  generalization  would  make  the  scheme  hazardous.  One  can  easily  envision  limit 
cycles,  in  a  heavily  loaded  situation,  where  resources  are  constantly  being  moved  around,  and  very  little  use  is 
actually  being  made  of  them.  It  may  be  necessary  to  introduce  hysteresis  by  taking  resources  from  another 
client  to  supply  a  new  request  only  if  the  total  system  value  of  all  resources  increases  by  more  than  a  certain 
amount. 

To  contain  the  risk  inherent  in  this  novel  approach,  we  will  allow  (as  a  last  resort)  the  subsequent  develop* 
ment  steps  to  restrict  the  clients  from  exercising  the  full  generality  of  the  interface. 

2.3.3  System  Functionality  Specification 

The  System  Functionality  Specification  document  will  explicitly  insist  on  adherence  to  the  Archons 
project's  guiding  concepts.  It  will  formally  define  certain  terms  (e.g^  atomicity,  negotiation,  compromise, 
consensus,  "guarantees")  that  characterize  these  concepts.  It  will  also  specify  that  certain  facilities  (e.g^ 
transaction  mechanisms,  deadlock  avoidance/detection  mechanisms,  recovery  mechanisms),  representing  in¬ 
stances  of  these  concepts,  shall  exist  in  ArchOS. 

The  System  Functionality  Specification  will  be  derived  from  the  Research  Requirements,  our  previous 
research,  and  the  assumption  that  an  incarnation  of  ArchOS  resides  at  each  node  in  a  physically  dispersed 
system. 

2.3.4  System  Architectural  Specification 

The  System  Architectural  Specification  documents  a  unit  of  work  that  takes  the  Research  Requirements, 
the  Gicnts’  Interface  Specification  and  the  System  Functionality  Specification  and  produces  a  design  of 
ArchOS,  in  the  form  of  layered  subsystems  (e.g„  IPC,  OS  File  System,  Gient  File  System,  OS  Resource 
Allocation,  Gient  Resource  Allocation,  OS  Transaction  Server,  Gient  Transaction  Server,  Timer  Services), 
that  satisfy  these  specifications. 

A  "uses"  hierarchy  [Pamas  74]  of  the  subsystems  (c.g.,  IPC  "uses"  Oc  Resource  Management,  and  Client 
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Resource  Allocation  "uses"  IPC)  will  be  produced  and  justified.  Note  that  a  subsystem.  UsingSubsystem.  that 
"uses”  another  subsystem.  UsedSubsysiem ,  can  (classically)  never  make  a  stronger  performance  guarantee 
(with  respect  to  the  service  supplied  by  UsedSubsysiem),  than  that  which  is  made  by  UsedSubsysiem.  We 
intend  to  examine  ways  to  specify  "suspicious  use"  of  another  subsystem.  UsedSubsysiem.  when  we  mean  that 
UsedSubsysiem  may  be  "used",  but  has  probabilistic  behavior.  In  such  a  way.  it  may  be  possible  to  have 
UsingSubsystem  promise  more  than  UsedSubsysiem  does  (c.g„  by  repeated  use  of  UsedSubsysiem .  or  confir¬ 
mation  of  UsedSubsysiem  through  other  means). 

It  is  also  important  to  realize  that  (classically)  the  proper  behavior  of  UsedSubsysiem  is  a  precondition  for 
the  specified  behavior  of  UsingSubsystem.  That  is.  if  UsingSubsystem  can  "use"  UsedSubsysiem.  then 
UsingSubsy  stem  can  exhibit  any  behavior  when  UsedSubsysiem  does  not  perform  to  specification.  While  it  is 
certainly  true  that  UsedSubsysiem  can  fail  in  undetectable  ways  that  can  only  mean  that  UsingSubsystem  must 
fail  it  is  also  true  that  UsedSubsysiem  can  fail  in  ways  that  are  detectable.  Because  we  are  concerned  with 
ultra- reliable  systems,  we  will  try  to  maximize  the  detectability  -of  the  failure  of  "used"  subsystems,  and 
require  (where  possible)  corrective  action  by  the  detector  (Le.,  "user").  Similarly,  a  precondition  of  every 
transaction  is  the  consistency  of  all  the  shared  data  objects  that  it  accesses.  A  classical  transaction  can  behave 
in  any  fashion  when  this  precondition  is  not  met.  We  would  like  our  transactions  to  take  positive  steps,  when 
possible,  toward  making  the  data  consistent  when  inconsistencies  are  detected. 

Each  subsystem  will  be  defined  and  will  have  its  behavior  specified.  Particular  concepts  and  facilities  from 
the  Functionality  Specification  will  be  associated  with  appropriate  subsystems. 

2.3.5  System  Design  Specification 

The  System  Design  Specification  document  defines  and  specifies  the  ArchOS  components.  A  component 
represents  the  intersection  of  an  ArchOS  subsystem  and  a  hardware  node.  This  is  the  unit  of  work  that 
establishes  the  decentralized  nature  of  ArchOS.  This  work  will  be  based  on  the  Research  Requirements,  the 
System  Functionality  Specification,  and  the  System  Architectural  Specification. 

It  is  (at  least  initially)  our  intention  to-  have  identical  node  components  for  any  given  subsystem.  The 
specification  of  a  component  must  include  the  (symmetrical)  interfaces  with  its  peers  at  other  nodes.  It  is 
through  the  protocols  with  its  peers  that  the  union  of  components  of  a  subsystem  will  supply  the  services  of 
the  subsystem  (as  required  by  the  System  Architectural  Specification)  using  decentralized  decision  making 
(concensus,  negotiation,  and  compromise).  We  will  use  our  interim  testbed  facility  to  evaluate  and 
demonstrate  the  specific  decentralized  resource  management  algorithms  considered  for  each  ArchOS  subsys¬ 
tem. 
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To  minimize  initial  complexity,  we  will  assume  that  all  interfaces  between  subsystems  occur  between  the 
respective  components  of  those  subsystems  at  the  same  node.  That  is,  all  protocols  arc  peer  level.  After  we 
develop  the  basic  algorithms  for  a  subsystem,  and  as  they  arc  subsequently  being  refined,  we  will  search  for 
optimizations  that  may  be  obtained  (and  analyze  the  information  hiding  that  may  be  lost)  by  allowing  other 
than  peer  level  protocols.  For  example,  if  the  OS  Rcsouce  Management  subsystem  component  handles  a 
request  for  disk  file  space,  originating  at  its  node,  by  asking  all  its  other  peer  components  "how  well  can  you 
satisfy  this  request"  and  analyzing  the  responses,  then  it  may  be  possible  to  have  the  subsystem  that  made  the 
request  broadcast  it  to  all  of  these  components  in  the  first  place. 

Wherever  possible,  the  SDS  will  not  make  any  assumptions  about  the  hardware  structure  of  a  node.  In  the 
event  that  it  is  impractical  to  specify  a  component  without  considering  the  underlying  hardware,  we  will  allow 
the  design  to  use  knowledge  of  the  interim  testbed  hardware.  This  point  must  always  be  deferred  as  long  as 
possible  and  the  work  that  is  based  on  this  knowledge  must  be  clearly  defined.  In  this  way,  a  well  defined  and 
minimized  amount  of  design  must  be  redone  when  we  move  to  other  hardware. 

In  a  similar  fashion,  only  the  communications  subsystem  will  be  allowed  to  have  knowledge  of  the  details  of 
the  interconnection  networks). 

2.3.6  Component  Design  Specification 

The  Component  Design  Specification  document  defines  and  specifies  the  modules  that  make  up  each 
ArchOS  component  This  work  will  be  based  on  the  Research  Requirements,  the  System  Functionality 
Specification,  and  System  Design  Specification.  A  module  represents  an  ArchOS  unit  that  can  be  imple¬ 
mented  by  a  single  programmer,  who  is  unversed  in  the  Archons  project  and  its  goals. 

2.3.7  Implementation 

The  implementation  of  ArchOS  will  consist  of  designing  and  programming  the  modules,  integrating  the 
modules  into  components,  testing  the  single  node  behavior  of  the  component,  and  (when  the  communications 
subsystem  components  are  working)  testing  the  multi-node  (i.e.,  full  subsystem)  behavior  of  the  component 
When  all  the  subsystems  have  been  implemented  we  will  proceed  to  experiment  with  ArchOS. 

We  plan  to  implement  ArchOS  from  the  bottom  up  (even  though  we  will  design  it  top  down),  so  that 
testing  a  component  will  require  test  driver  programs  only  to  exercise  its  higher  level  interfaces. 

The  module  programmers  will  work  from  (the  appropriate  portions  of)  the  ArchOS  Component  Specifics- 


2.4  Conclusion 

We  have  arguments  and  evidence  that  physically  and  logically  decentralized  resource  management  as  we 
have  defined  them  offer  significant  potential  benefits  over  conventional  approaches  in  some  applications, 
particularly  real-time  supervisory  control  -  e.g..  embedded  computers  for  combat  platform  management  and 
industrial  automation. 

We  are  performing  conceptual,  theoretical,  and  experimental  work  to  discover  the  types  of  benefits  which 
can  be  achieved,  the  conditions  under  which  they  can  and  cannot  be  achieved,  and  the  costs  of  achieving 


The  Archons  project  has  been  progressing  for  approximately  three  years,  creating  and  developing  the 
concepts  of  decentralized  resource  management,  performing  theoretical  analysis,  planning  the  structures  of 
the  operating  system  and  eventual  hardware,  implementing  the  interim  testbed,  etc. 

The  specification,  design,  and  implementation  of  ArchOS  began  this  year,  and  we  estimate  will  require 
three  years  for  completion  of  the  first  experimental  prototype.  The  manpower  committed  to  this  ArchOS 
portion  of  the  Archons  project  currently  consists  of  five  full-time  professional  position  (e.g,  post-doctoral) 
researchers  (two  of  these  slots  as  yet  being  unfilled),  four  fulltime  Ph.D.  students,  and  part-time  participation 
by  several  other  of  the  Archons  personnel  (two  faculty,  a  program  manager,  and  three  Ph.D  students). 
Additional  staffing  will  be  added  as  necessary  -  for  example,  programmers  when  full  scale  implementation 
begins. 

Further  conceptual  formal  and  experimental  research  on  the  principles,  design,  and  implementation  as¬ 
sociated  with  the  entire  Archons  project  are  continuing  concurrently  with  the  ArchOS  effort  (involving  about 
eight  full-time  equivalent  researchers). 
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3.  Transactions  for  Operating  Systems 


3.1  Overview 

The  use  of  atomic  transactions  in  the  kernel  of  an  operating  system  is  one  of  the  fundamentally  important 
aspects  of  our  research.  While  transactions  are  well  understood  at  the  higher  level  of  database  management, 
most  of  that  knowledge  is  inapplicable  at  the  lower  operating  system  kernel  level.  While  we  believe  we  were 
at  least  one  of  the  first  to  recognize  the  value  of  transactions  in  operating  systems,  others  have  subsequently 
begun  to  explore  this  also.  But  again  our  efforts  have  been  pioneering  in  their  focus  initially  on  a  formal  basis 
for  data  consistency  and  transaction  correctness,  rather  than  taking  an  application-dependent  ad  hoc  ap¬ 
proach.  Other  contributions  of  the  Archons  transaction  principles  arc  improved  modularity  and  fault 
tolerance.  We  base  our  research  on  what  we  term  a  "relational  data  model"  and  "set-wise  serializable"  atomic 
transactions. 

3.2  Relational  Data  Model 

3.2.1  Introduction 

In  distributed  systems,  multiple  entities  (at  any  particular  level  of  abstraction)  perform  tasks  by  co¬ 
operating  in  various  ways  so  as  to  improve  concurrency,  reliability,  and  modularity,  as  well  as  to  accommodate 
physical  dispersal.  Co-operation  implies  some  form  of  synchronization  among  processes  or  synchronization 
of  concurrent  access  to  shared  data  objects.  The  former  type  of  co-operation  has  been  pursued  primarily  in 
centralized  uniprocessor  and  multiprocessor  computers,  while  most  distributed  systems  are  computer  net¬ 
works  and  thus  focus  on  the  latter  type.  Furthermore,  computer  networks  (and  centralized  computers  to  a 
lesser  extent)  typically  exhibit  a  form  of  co-operation  exemplified  by  autonomous  client  and  server  functions. 
Instead,  the  Archons  project  is  performing  research  on  the  science  and  engineering  necessary  for  a  decentral¬ 
ized  computer  -  a  new  hybrid  which  is  a  single  computer  in  the  sense  of  a  multiprocessor,  but  is  physically 
dispersed  much  like  a  local  network.  The  appropriate  paradigm  of  co-operation  in  such  a  machine  seems  to 
be  peer  relationships  in  which  a  (variable)  number  of  equal  partners  collaborate  on  a  function  (e.g.,  to  jointly 
fill  a  single  role).  We  are  particularly  interested  in  styles  of  co-operation  where  a  team  of  equals  negotiate, 
compromise,  and  reach  a  consensus  to  manage  resources  in  a  global  operating  system,  despite  inaccurate  and 
incomplete  information  within  the  operating  system  itself  (resulting  from  communication  delays)  (Jensen  82). 
As  a  consequence  of  this  situation,  the  high  degree  of  internal  deterministic  behavior  assumed  to  be  easily 
achieved  in  classical  centralized  computers  can  be  very  expensive  in  distributed  systems.  Thus,  decentralized 
computers  must  necessarily  be  designed  to  deal  with  indeterminism  explicitly,  systematically,  and  to  their  best 
advantage  (transparently  to  their  users). 
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This  paradigm  has  icad  us  to  develop  a  new  relational  model  of  data  consistency  that  allows  one  to  reason 
about  the  relationships  among  collections  of  processes,  data  objects,  and  state  variables  in  distributed  systems. 
Compared  with  other  approaches,  such  as  the  conventional  serialization  model,  our  model  provides  greater 
concurrency  in  many  interesting  cases,  is  free  from  synchronization-induced  deadlock  and  rollback,  and 
uniformly  accommodates  both  process  and  data  synchronization. 

Based  on  this  model,  we  have  created  a  new  approach  to  distributed  co-operating  processes,  and  the  concept 
of  co-operating  transactions  supporting  various  forms  of  decentralized  control  among  peers  (including  both 
indeterministic  and  deterministic  forms  of  interaction).  Using  our  model,  the  synchronization  of  distributed 
co-operating  processes  is  formulated  as  the  preservation  of  a  set  of  dependency  relationships  among  their  state 
variables.  In  the  interest  of  efficiency,  these  dependency  relationships  may  be  formulated  as  probabilistic 
whenever  the  application  permits.  Co-operating  transactions  are  co-operating  processes  whose  interactions 
are  made  atomic  for  the  sake  of  reliability.  Co-operating  transactions  cannot  be  implemented  using  the 
conventional  serialization  model  of  data  consistency  because  of  the  generality  of  the  communication  involved. 

We  begin  the  remainder  of  this  paper  by  introducing  our  relational  model  of  data  consistency,  followed  by 
a  description  of  co-operating  processes,  and  then  a  discussion  of  co-operating  transactions.  Some  of  these 
ideas  are  illustrated  by  examples  from  our  initial  experience  in  applying  them  to  the  Accent  [Rashid  81] 
network  operating  system  and  other  Spice  personal  computer  system  software  [Schaffer  82,  Ball  81].  These 
ideas  will  also  appear  in  the  ArchOS  operating  system  for  the  Archons  decentralized  computer. 

3.2.2  The  Relational  Model  of  Data  Consistency 

3.2.2. 1  Our  Objections  to  the  Serialization  Model 

Most  of  the  work  on  synchronization  methods  for  distributed  systems  has  been  done  in  the  context  of 
distributed  database  systems,  and  is  based  on  the  serialization  model  of  data  consistency  [Bernstein  80].  The 
basic  concept  of  the  serialization  model  is  that  if  each  transaction  executing  alone  maintains  the  consistency  of 
the  data  objects,  then  executing  transactions  serially  and  in  any  order  of  execution  will  also  be  correct,  i.e., 
maintain  the  consistency  constraints.  Therefore,  a  set  of  sufficient  conditions  for  the  correct  concurrent 
execution  of  transactions  is  one  which  can  be  proven  equivalent  to  a  serial  order  of  execution.  One  well 
known  form  of  these  conditions  is  [Papadimitriou  77]: 

L  There  exists  a  total  ordering  of  the  set  of  transactions. 

2.  For  every  pair  of  operations  that  conflict  (i.e„  at  least  one  operation  is  a  write),  their  precedence 
relation  on  a  shared  data  object  must  be  identical  to  that  of  their  corresponding  transactions  in  the 
total  ordering  of  transactions. 
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Although  the  serialization  model  is  very  general,  in  the  sense  that  the  consistency  constraints  can  be 
preserved  with  knowledge  of  conflicts  being  the  only  semantic  information  about  the  transactions  (Kung  79a]. 
it  is  inadequate  with  respect  to  the  needs  of  distributed  operating  systems  (especially  those  based  on  peer 
relationships  rather  than  client-server  type  relationships). 

•  The  serialization  model  lacks  concurrency.  Kung  and  Papadimitriou  [Kung  79a]  show  that  it  uses 
only  syntactic  (and  conflict)  information  about  transactions,  and  that  it  is  possible  to  formulate 
more  efficient  non-serializable  transactions  by  using  information  about  data  objects  or  additional 
semantic  information  about  transactions.  For  example,  the  work  of  Lamport  [Lamport  76],  Kung 
and  Lehman  [Kung  79b].  Schwarz  and  Specter  [Schwarz  82],  Garcia-Molina  [Molina  83],  and 
Allchin  and  McKendry  [Allchin  82]  all  further  demonstrate  this  point  Concurrency  is  a  critical 
issue  in  operating  systems,  and  the  information  needed  to  improve  it  is  often  available  (neither  of 
which  may  be  as  much  the  case  at  the  applications  level,  e.g.,  in  database  systems). 

•  The  serialization  model  suffers  synchronization- induced  deadlock  and  rollback  problems  [Bernstein 
80].  Synchronization  methods  based  on  the  serialization  model  can  be  classified  into  two  basic 
approaches  -  two  phase  locking  and  time  stamps.  The  two  phase  lock  approach  can  lead  to 
deadlock,  while  the  time  stamp  approach  is  prone  to  problems  caused  by  rollback. 

•  The  serialization  model  precludes  a  distributed  (e.&.  either  decentralized  or  network)  operating 
system  kernel  from  using  atomic  transactions  for  communication  and  co-operation  [Lamport  76]. 

When  a  pair  of  transactions  exchange  messages  in  the  course  of  an  interaction,  their  operations 
(Le„  the  two  way  communications)  might  be  interleaved  so  as  to  violate  the  relative  ordering 
condition  (Lc,  2.  above)  required  by  the  serialization  model 

•  The  serialization  model  does  not  support  the  synchronization  of  co-operating  processes.  Co¬ 
operating  processes  must  be  permitted  to  change  their  states  autonomously  as  long  as  they  are  not 
in  those  states  that  are  governed  by  the  specified  rules  of  co-operation  (in  our  case,  the  set  of 
dependency  relations).  However,  the  serialization  model’s  conditions  hold  at  all  times,  turning  the 
power  of  its  generality  against  its  use  for  interprocess  co-operation. 


To  remedy  these  disabilities,  we  have  supplanted  the  serialization  model  with  our  own  model  based  on 
relationships  among  the  data  objects.  We  share  the  premise  that  each  transaction  executing  alone  preserves 
the  consistency  constraints  of  the  data  objects.  But  we  further  assume  that  the  relationships  affecting 
synchronization  among  the  data  objects  are  known.  This  seems  to  be  a  justifiable  assumption  in  our  context 
of  distributed  operating  systems. 

In  daub ase  systems  based  on  the  serialization  model  serializability  is  taken  as  the  consistency  constraint, 
i.e„  the  correctness  criterion.  In  several  current  efforts  on  non-serializable  transactions,  serializability  is 
viewed  as  a  "strong  form"  of  the  correctness  criteria  needed  by  certain  applications  and  not  by 
others  [Schwarz  82,  Molina  83,  Allchin  82].  In  our  approach  to  the  correctness  issue,  consistency  constraints 
are  modeled  as  relations  among  dau  objects,  and  are  partitioned  into  an  application  independent  part  called 
data  invariants  and  an  application  dependent  part  called  action  invariants.  The  execution  of  concurrent 


processes  or  transactions  is  defined  to  be  correct  if  it  satisfies  both  die  data  and  action  invariants,  independent 
of  whether  the  processes  or  transactions  are  serializable.  This  is  because  senalizability  is  not  a  relation  among 
data  objects  and  therefore  not  a  consistency  constraint.  In  our  view,  seriaiizability  is  only  a  set  of  sufficient 
conditions  to  maintain  consistency  constraints. 

3. 2.2. 2  Classification  of  Relations 

Our  relational  model  of  data  consistency  classifies  the  possible  reladonships  among  data  objects  as 
autonomous,  dependent,  or  partially  dependent. 

•  Autonomous:  The  relation  is  defined  as  the  set  of  the  cartesian  products  of  the  domains  of  the  data 
objects.  From  a  synchronization  point  of  view,  the  implication  of  an  autonomous  relationship  is 
that  object  A  can  take  on  any  value  that  is  in  its  domain,  regardless  of  the  value  of  11  (i.c„  A  and  B 
can  be  updated  separately). 

An  autonomous  relation  will  be  called  probabilistic  if  a  joint  probability  distribution  is  defined 
upon  the  set  of  cartesian  products.  The  concept  of  probabilistic  relations  is  important  to  our 
discussion  in  the  section  on  co-operating  processes. 

•  Dependent:  The  relation  is  defined  by  a  proper  subset  of  the  cartesian  products  of  the  domains  of 
the  data  objects.  In  this  case,  the  value  taken  by  a  data  object,  A.  is  constrained  by  the  value  taken 
by  another  data  object,  B,  and  vice  versa.  The  implication  of  this  type  relationship  is  that  when 
there  are  dependency  relationships  among  data  objects,  these  data  objects  can  no  longer  be  up¬ 
dated  independently. 

•  Partially  dependent:  The  relation  is  defined  as  a  proper  subset  of  the  cartesian  products  of  data 
object  domains,  a  pan  of  which  takes  the  form  of  cartesian  products  of  subsets  of  the  domains. 

For  example,  if  the  domains  of  A  and  B  are  both  {0.  L,  2}  with  the  data  invariant 
"if  A =2,  then  A  =  B",  then  the  partially  dependent  relation  is  the  set  consisting  of  the  tuple  <2,2> 
concatenated  with  the  set  of  cartesian  products  {0, 1}  x  {0, 1,  2}.  The  notion  of  partially  depend¬ 
ent  relationships  allows  us  to  view  process  synchronization  as  the  act  of  maintaining  the  data 
invariants  among  distributed  state  variables.  Suppose  A  and  B  are  state  variables  of  processes  P. 
and  P2  respectively.  We  can  interpret  the  example  above  as  "process  P2  must  enter  state  two  if 
process  Px  enters  state  two,  otherwise  processes  P2  and  P2  can  change  their  states  autonomously." 

3. 2.2. 3  Definitions 

We  now  proceed  to  make  the  following  definitions. 

•  Data  objects:  the  user  defined,  smallest  unit  of  data  items  that  can  be  synchronized  (e.g„  locked). 

•  Data  invariants:  the  mathematical  representation  of  the  dependency  relationships  among  data 
objects  (e.g^  "A = B").  Data  invariants  must  be  preserved  by  all  processes  or  transactions. 

•  Atomic  data  sets:  user  defined  disjoint  sets  of  data  objects,  each  of.  which  is  constrained  by  a 
user-specified  set  of  data  invariants.  For  example,  one  set  has  data  objects  A  and  B  with  invariant 
"A  =  B",  and  another  set  has  data  objects  C  and  D  with  invariant  "C  >  D”.  Atomic  data  sets  are 
our  model  for  the  modular  decomposition  of  operating  system  data  objects. 
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•  Action  invariants:  a  proper  subset  of  data  invariants  that  arc  application  dependent.  i.e„  contin¬ 
gent  upon  the  actions  of  specific  transactions.  For  example,  the  requirement  that  the  sum  of  two 
bank  accounts  remain  unchanged  after  a  fund  transfer  transaction  between  them  is  modeled  as  the 
action  invariant  ’’either  the  credit  and  debit  must  both  be  done  or  neither  is  done”.  This  require¬ 
ment  must  hold  at  the  end  of  the  transaction,  but  need  not  hold  at  other  times;  Le.,  the  sum  of  the 
two  accounts  may  change  across  time. 

•  Conformity:  a  concurrent  access  to  shared  data  objects  which  preserves  all  of  the  data  invariants 
and  satisfies  ail  action  invariants.  Note  that  conformal  transactions  may  or  may  not  be  serializ¬ 
able. 

3.2. 2. 4  Representations  of  Data  Objects  and  Data  Invariants 

Each  data  object  is  internally  represented  by  triplets,  <namc,  value,  version  numberX  When  a  data  object  is 
created,  its  initial  value  is  assigned  to  version  zero  of  this  data  object,  such  as  ”A[0]:  =  1”. 

When  the  data  object  is  to  be  updated,  a  new  version  of  the  object  is  created  and  the  transaction  works  on 
this  new  version.  For  example,  the  code  "A:  ■A+l"  in  an  update  transaction  corresponds  to  the  following 
steps: 

A[v+1]  :■  A[v];  {v  is  the  version  number} 

A[v+1]  A[v+1]  +  I*  {A  :=  A+l} 

v  :«  v+1;  {If  the  transaction  commits} 

If  this  transaction  successfully  commits,  the  new  version  becomes  permanent.  Old  versions  can  be  kept  in 
the  log  file  as  back-ups  or  discarded.  The  importance  of  this  representation  to  us  is  that  it  provides  a  concrete 
representation  of  the  data  invariants.  For  example,  the  data  invariant  "A=B"  could  be  represented  as 
A[v]=B[v],  v=0, 1, 2,  3,  —  When  a  transaction  updates  the  version  number  of  one  object  in  an  atomic  data 
set,  it  then  updates  the  version  numbers  of  all  other  objects  in  that  set.  Since  data  invariants  are  defined  upon 
data  objects  with  identical  version  numbers,  a  version  of  an  atomic  data  set  exists  at  a  particular  time  if  and 
only  if  that  version  of  all  its  objects  exists  at  that  time. 

3.2.2.5  Some  Important  Observations 

In  this  section  we  state  three  important  observations,  based  on  our  model,  that  are  relevant  to  our  later 
discussion.  These  observations  are  presented  here  in  an  intuitive  fashion,  but  will  appear  in  a  more  formal 
manner  in  Sha's  thesis  [Sha  83]. 

1.  A  sufficient  condition  for  conformity:  Conflicting  transactions  must  be  mutually  exclusive  with 
respect  to  the  version  number  of  the  shared  atomic  data  set.  That  is.  data  objects  belonging  to  the 
same  version  can  be  shared  by  several  read  transactions,  but  they  can  only  be  modified  by  a  single 
update  transaction.  Under  this  condition  and  our  first  assumption,  a  transaction  will  preserve  the 
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data  invariants  of  each  of  the  accessed  atomic  data  sets.  Mutual  exclusion  with  respect  to  version 
number  can  be  obtained  by  using  any  appropriate  synchronization  method  (Reed  79,  Habermann 
79,  Thomas  79]. 

1  Concurrent  updating  of  data  objects:  Data  objects  belonging  to  different  atomic  data  sets  can  be 
updated  in  any  arbitrary  order  permitted  by  the  action  invariants  of  the  updating  transaction. 
This  is  because  there  arc  no  data  invariants  across  the  boundaries  of  atomic  data  sets.  Within  an 
atomic  data  set,  a  transaction  can  only  update  data  objects  with  the  same  version  number  (i.e„  a 
transaction  can  only  operate  on  a  particular  atomic  data  set  version).  However,  there  can  be  N 
concurrent  updates  on  an  atomic  data  set  of  N  data  objects.  Mutual  exclusion  with  respect  to 
version  number  means  that  transactions  can  concurrently  update  different  data  objects  of  the 
same  atomic  data  set,  as  long  as  these  data  objects  are  in  different  versions.  For  example,  imagine 
an  atomic  data  set  consisting  of  data  objects  A  and  B;  transaction  Tj  works  on  A[l]  (producing 
A[2]).  and  then  begins  work  on  B[l].  This  would  permit  a  transaction  T2  tc  begin  work  on  A(2] 
while  T1  is  still  working  on  B(l]. 

To  support  N  concurrent  updates  of  an  atomic  data  set  with  N  data  objects,  two  copies  for  each  of 
the  N  data  objects  are  needed.  One  set  of  copies  is  used  for  the  atomic  data  set  checkpoint 
version,  while  another  set  is  used  to  store  the  most  recent  versions  of  the  data  objects.  As 
transactions  update  the  data  objects  in  the  atomic  data  set,  the  current  versions  of  the  data  objects 
could  be  different  Note  that  aborting  an  earlier  transaction  will  lead  to  the  cascaded  abortion  of 
later  transactions  operating  on  data  objects  with  later  version  numbers.  The  trade-off  between 
increased  concurrency  and  the  potential  for  cascaded  aborts  is  an  important  design  issue.  Assum¬ 
ing  that  all  the  transactions  following  the  checkpoint  version  are  kept  in  a  recovery  log.  the  system 
can  always  recover  to  the  checkpoint  version  after  a  system  failure.  A  new  checkpoint  version  can 
be  made  whenever  the  current  versions  of  all  of  the  constituent  data  objects  in  the  atomic  data  set 
have  the  same  version  number.  However,  if  a  fixed  interval  between  checkpoints  is  desired,  then 
either  some  concurrency  in  the  updating  process  must  be  sacrificed,  or  additional  state  save 
operations  will  be  required.  In  summary,  a  small  amount  of  additional  storage  for  the  version 
numbers  makes  it  possible  to  have  both  better  concurrency  and  ease  of  recovery,  even  when 
cascaded  aborts  are  involved. 

3.  No  deadlock  or  rollback  problems  result  from  synchronisation.  In  our  relational  model,  each  of  the 
conflicting  transactions  will  obtain  a  unique  version  for  each  of  the  atomic  data  sets  accessed  by 
them.  Since  the  data  invariants  of  each  of  the  atomic  data  sets  can  be  satisfied  independent  of 
other  atomic  data  sets,  each  of  these  transactions  can  autonomously  produce  new  versions  of  the 
atomic  data  sets.  This  cannot  cause  deadlock,  because  the  generation  of  new  versions  makes  the 
atomic  data  sets  available  to  other  transactions.  There  is  also  no  possibility  that  this  synchroniza¬ 
tion  will  produce  a  rollback,  because  there  are  no  time  stamps  used  to  impose  a  global  order  in  the 
execution  of  transactions. 

3.3  A  Modular  Approach  to  Non-serializable  Concurrency  Control: 
Database  Consistency,  Transaction  Correctness,  and  Schedule 
Optimality 


3.3.1  Introduction 

As  part  of  (he  Archons  decentralized  computer  system  project,  we  are  developing  a  decentralized  operating 
system  with  atomic  transaction  facilities  embedded  at  the  kernel  level  [Jensen  83).  The  concurrency  control  of 
the  executions  of  transactions  has  been  a  very  active  area  of  research.  A  major  development  in  this  area  is  the 
establishment  of  the  scrializability  theory  [Bernstein  79,  Papadimitriou  77J.  Since  the  performance  of  a  dis¬ 
tributed  computer  system  depends  greatly  on  concurrency  control,  the  desire  to  obtain  a  very  high  degree  of 
concurrency  motivates  many  to  investigate  the  use  of  non-serializable  schedules. 

From  a  programming  point  of  view,  a  transaction  programmer  has  two  duties.  First,  he  is  responsible  for 
the  consistency  and  the  correctness  of  his  transactions.  That  is,  transactions  must  preserve  the  consistency  of 
the  shared  data  objects  (database)  and  produce  results  as  specified  when  executing  alone.  In  the  following,  we 
assume  that  all  transactions  under  discussion  are  consistent  and  correct.  Second,  the  programmer  must 
schedule  his  written  transaction  according  to  some  scheduling  rules  implemented  by  locks  or  other 
mechanisms.  The  concurrency  control  mechanisms  embedded  in  the  transactions  allow  transactions  to  be 
executed  concurrently  but  in  such  a  way  that  the  consistency  of  the  database  and  the  correctness  of  each 
transaction  are  preserved. 

Rung  and  Papadimitriou  [Rung  79a]  showed  that  the  degree  of  concurrency  provided  by  any  scheduling 
rule  is  limited,  and  the  bound  is  determined  by  the  information  used  by  the  scheduling  rule.  They  showed 
that  serializability  theory  provides  the  highest  possible  degree  of  concurrency,  when  only  information  about 
the  classically  defined  transaction  syntax  i$  used.  Since  serializability  theory  does  not  utilize  the  semantic 
information  of  transactions,  most  of  the  recent  work  on  non-serializable  transactions  has  focused  on  the  use  of 
semantic  information  of  the  transaction  system  to  enhance  concurrency  [Lamport  76,  Schwarz  82,  Allchin 
82,  Lynch  83a,  Molina  83].  In  this  approach,  the  details  of  each  of  the  transactions  are  carefully  examined,  and 
a  permissible  interleaving  of  transaction  steps  is  then  specified  accordingly.  For  example,  in  [Lynch  83a] 
transactions  are  grouped  together  by  some  classification  scheme.  Permissible  interleavings  for  each  of  the 
given  groups  are  specified  by  a  corresponding  set  of  "break  points"  embedded  between  transaction  steps. 
Sets  of  break  points  can  be  organized  into  a  hierarchical  form.  Since  it  is  impossible  to  predict  the  semantics 
of  various  transactions  in  advance,  the  transaction  system  semantic  information  approach  emphasizes  the 
development  of  syntactic  structures  to  support  programmers’  specification  of  their  own  scheduling  rules.  The 
"break  point  specifications"  in  [Lynch  83a]  and  "lock  compatibility  tables"  in  [Schwarz  82]  are  two  examples. 
Schedules  consistent  with  user  specifications  are  defined  to  be  both  consistent  and  correct.  It  is  assumed  that 
programmers  understand  the  details  of  each  others’  transactions.  They  are  responsible  for  the  consistency  and 
correctness  of  their  own  scheduling  rules. 

The  strength  of  this  transaction  system  semantic  information  approrrh  is  that  it  allows  programmers  to 
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develop  their  own  concurrency  control  rules  that  arc  tailored  to  their  specific  applications.  This  provides  the 
potential  to  obtain  a  very  high  degree  of  concurrency.  On  the  other  hand,  this  transaction  system  semantic 
information  approach  docs  not  seem  to  be  suitable  for  a  general  transaction  facility  for  two  reasons:  it  neither 
provides  application  independent  scheduling  rules  nor  addresses  the  issue  of  modularity. 

An  important  contribution  of  the  scrializability  theory  is  that  it  provides  application  independent  schedul¬ 
ing  rules.  As  long  as  programmers  follow  a  prescribed  protocol  such  as  the  "two  phase  lock"  [Eswaren  76], 
the  consistency  and  correctness  of  concurrency  control  is.  ensured.  It  should  be  noted  that  serializable 
schedules  allow  individual  transactions  to  be  regarded  as  if  they  are  executing  alone.  The  consistency  and 
correctness  of  serializable  schedules  follows  immediately  from  the  consistency  and  correctness  of  individual 
transactions.  When  schedules  are  non-scrializable,  transactions  can  no  longer  be  regarded  as  if  they  are 
executing  alone.  Proving  the  consistency  and  correctness  of  any  non-serializable  scheduling  rule  is  therefore 
necessary.  Hence,  it  is  very  desirable  to  have  a  non-serializable  concurrency  control  theory,  which  provides 
scheduling  rules  that  are  proven  to  be  both  consistent  and  correct.  Such  rules  would  release  programmers 
from  the  burden  of  inventing  their  own  scheduling  rules  and  then  proving  their  rules  to  be  consistent  and 
correct  in  each  case. 

Another  difficulty  in  applying  the  transaction  system  semantic  information  approach  to  a  general  trans¬ 
action  facility  is  that  this  approach  does  not  address  the  issue  of  modularity.  Since  this  is  not  a  common  topic 
in  the  context  of  concurrency  control  we  begin  with  an  example.  Consider  a  database  consisting  of  only  two 
variables  A  and  B  with  consistency  constraint  "A  +  B  =  100".  Suppose  that  there  are  two  "fund  transfer" 
transactions:  T2  =  {t^:  A  :=  A  - 1;  t^:  B  :=  B  +  1}  andT2  =  {t^:  B  :=  B  -  2;  t^:  A  :=  A  +  2},  where 
tj  denotes  step  j  of  transaction  l  It  is  easy  to  verify  that,  in  addition  to  the  serializable  schedules,  the 
non-serializable  schedule,  {t^;  t^;  t^;  t^},  is  also  consistent  and  correct.  That  is.  the  consistency  of  the 
database  is  preserved  and  each  transaction  correctly  performs  the  "fund  transfer”  task  when  transactions  are 
executed  according  to  this  schedule.  Observing  this,  one  might  suggest  that  the  transfer  transactions  be 
scheduled  by  means  of  putting  a  break  point  between  the  two  steps  in  the  transaction.  Now  suppose  that  we 
implement  transfer  transaction  T2  differently  by  changing  the  second  step  "t^:  A  :  =  A  +  2"  to  "t^:  A  :  = 
100  -  B”.  When  executing  alone,  the  modified  T2,  like  the  original  preserves  the  consistency  of  the  database 
and  transfen.  2  units  from  B  to  A.  In  addition  to  performing  the  same  function,  both  versions  of  T2  have  two 
steps  using  the  same  commutative  operators  "add"  and  "subtract”.  One  might  suggest  putting  a  break  point 
between  step  and  t^  and  scheduling  the  modified  transaction  system  as  before,  {t^;  t^;  t^;  t^}. 
Unfortunately,  this  time  the  schedule  always  leaves  the  database  inconsistent  For  example,  let  both  A  and  B 
be  50  initially.  Step  t^  changes  A  from  50  to  49.  Step  t^  changes  B  from  50  to  48.  Step  t^j  changes  A  from 
49  to  52.  At  this  point  A  +  B  =  100.  The  last  step  t^  adds  one  to  B  and  leaves  the  sum  of  A  and  B  equal  to 


101.  The  lesson  is  that  the  specification  of  the  pre-  and  post-conditions  of  transactions  is  generally  insufficient 
for  the  specification  of  break  points.  To  correctly  specify  permissible  interleavings  that  utilize  the  semantic 
information  of  a  group  of  transactions,  programmers  must  understand  the  interactions  among  the  steps  of  all 
transactions  in  the  group.  When  the  transactions  are  complex,  written  and  modified  by  many  different 
programmers  from  time  to  time,  such  a  task  could  quickly  become  unmanageable.  In  software  engineering, 
one  of  the  basic  principles  for  the  development  of  a  large  scale  system  is  to  partition  it  into  implementation 
independent  modules  [Habermann  76).  The  interleavings  introduced  by  the  transaction  system  semantic 
information  approach,  however,  could  create  implementation  sensitive  inter-dependence  among  transactions. 
As  suggested  by  Molina  [Molina  83),  it  seems  appropriate  to  view  the  transaction  system  semantic  information 
approach  as  a  powerful  tool  to  solve  specific  and  static  transaction  problems  that  require  a  very  high  degree  of 
concurrency,  analogous  to  the  VLSI  solutions  to  special  computation  problems. 

Distributed  operating  systems  are  known  to  be  very  complex,  written  and  modified  by  many  different 
programmers  over  a  period  of  years.  Any  non-modular  approach  to  concurrency  control  is  likely  to  be 
unmanageable.  Given  the  difficulty  of  guaranteeing  the  consistency  and  correctness  of  schedules  resulting 
from  applying  the  transaction  system  semantic  information  approach  to  a  general  transaction  facility,  it  seems 
important  to  develop  a  new  non-serializable  approach.  This  new  approach  should  provide  scheduling  rules 
with  the  following  properties: 

1.  These  rules  generate  only  consistent  and  correct  schedules; 

1  These  rules  are  modular  ir  the  sense  that  they  permit  one  to  write,  modify  and  schedule  one’s 
transaction  independently,  knowing  only  that  other  transactions  will  be  consistent,  correct  and 
written  in  the  given  syntax. 

Our  approach  begins  by  observing  the  three  types  of  information  available  to  scheduling  rules  under  the 
above  requirements.  The  first  type  is  the  information  about  the  consistency  constraints  of  the  database. 
Programmers  are  informed  about  the  consistency  constraints,  and  they  are  responsible  for  the  preservation  of 
the  database  consistency.  The  second  type  is  the  syntactic  definition  of  transactions.  Transactions  must  be 
written  in  the  given  syntax.  The  third  type  is  the  semantic  information  of  one’s  own  transaction.  In  short,  we 
assume  that  when  a  programmer  is  ready  to  schedule  his  transaction,  he  knows  the  consistency  constraints  of 
the  database,  the  details  of  his  transaction  which  he  has  just  written  or  modified  and  the  syntax  of  the 
transactions  that  everyone  must  follow.  He  makes  no  assumption  about  others'  transactions  except  that  they 
are  consistent,  correct  and  written  in  the  given  syntax. 

In  this  paper,  our  task  is  twofold.  The  first  is  to  develop  new  syntactic  structures  that  can  be  used  to  enhance 
concurrency.  The  second  is  to  identify  new  scheduling  rules  that  make  the  best  use  of  available  information. 
It  turns  out  that  for  a  given  set  of  primitive  steps  the  best  we  can  do  in  the  development  of  modular 
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scheduling  rules  is  to  decompose  both  the  database  and  transactions  into  consistency  preserving  units  of  data 
objects  and  transaction  steps  respectively.  When  transactions  and  the  database  arc  decomposed  into  such 
smaller  "consistent  preserving"  units,  highly  concurrent  schedules  can  then  be  developed.  We  would  like  to 
mention  that  these  smaller  disjoint  "consistent  preserving”  units  also  facilitate  failure  recovery,  although  this 
topic  is  outside  the  scope  of  this  paper  and  will  be  pursued  elsewhere. 

This  paper  is  organized  4s  follows.  We  first  develop  the  notion  of  a  consistency  preserving  partition  of 
database.  We  then  develop  our  model  in  detail  for  the  classical  single  level  transactions.  Next,  we  extend  our 
results  to  nested  transactions  and  then  to  compound  transactions.  Finally,  we  investigate  the  optimality  of 
modular  and  application  independent  scheduling  rules. 

3.3.2  A  Model  of  Operating  System  Database 

There  is  a  general  consensus  that  an  operating  system  should  be  built  in  a  modular  fashion.  A  typical 
module,  such  as  the  monitor  [Hoare  74]  commonly  used  in  centralized  operating  systems,  consists  of  a  set  of 
shared  data  objects  and  a  set  of  pre-defined  procedures  that  facilitate  the  manipulation  of  these  shared  data 
objects.  In  addition,  there  is  a  simple  scheduler  embedded  in  the  module  to  ensure  that  users  access  this  set  of 
shared  data  objects  in  a  strictly  serial  fashion  via  some  mutual  exclusion  mechanism.  However,  there  is  little 
agreement  on  what  constitutes  the  basis  of  a  module  when  shared  data  objects  are  distributed  across  nodes  in 
a  distributed  computer  system.  Our  approach  to  this  problem  focuses  on  the  consistency  constraints  among 
system  data  objects.  Given  the  consistency  constraints  of  the  system  database,  we  show  that  the  database  can 
be  partitioned  into  disjoint  sets  of  data  objects  called  atomic  data  sets  (ADS).  Such  a  partition  is  consistency 
preserving  in  the  sense  that  the  consistency  of  each  ADS  can  be  maintained  independently,  and  the  conjunc¬ 
tion  of  the  consistency  constraints  of  atomic  data  sets  is  equivalent  to  the  consistency  constraints  of  the  entire 
database.  It  will  also  be  shown  that  there  always  exists  an  unique  maximal  consistency  preserving  partition. 

From  an  application  point  of  view,  atomic  data  sets  can  be  used  as  a  basis  for  constructing  distributed 
software  modules:  module  that  encapsulate  distributed  data  objects.  For  example,  one  can  easily  generalize 
the  monitor  (or  abstract  data  type)  approach  developed  for  centralized  systems  as  follows.  We  can  define  a  set 
of  primitive  procedures  for  each  of  the  data  objects  in  an  atomic  data  set  to  facilitate  the  manipulation  of  these 
data  objects.  When  the  scheduling  rules  grant  one  the  privilege,  one  is  entitled  to  use  those  pre-defined 
procedures  as  building  blocks  of  his  own  transaction.  Before  the  development  of  the  formalism,  we  would, 
however,  like  to  first  present  an  example  to  illustrate  the  concepts  of  the  consistency  preserving  partition  of 
the  database.  We  also  use  this  example  to  provide  an  intuitive  discussion  of  some  issues  related  to  the  design 
of  consistency  constraints  of  a  distributed  system.  The  investigation  of  the  principles  of  designing  consistency 
constraints  for  a  distributed  operating  system  to  enhance  system  performance  is  likely  to  become  an  important 
area  of  research. 


3.3. 2.1  Consistency  Preserving  Partition  of  Database  —  An  Example 

ITic  very  nature  of  a  distributed  system  provides  us  with  both  the  opportunity  of  realizing  a  very  high 
degree  of  parallelism  and  the  difficulty  of  coping  with  large  communication  delays.  In  order  to  maximize  the 
benefit  of  parallelism  and  to  minimize  the  the  performance  penalty  caused  by  communication  delay,  it  is 
often  useful  to  consider  the  use  of  consistency  constraints  that  arc  weaker  chan  the  corresponding  ones  in 
centralized  systems.  We  illustrate  this  idea  by  a  simplified  case  of  managing  the  directories  of  a  file  system. 

We  consider  a  set  of  shared  system  files  distributed  at  different  nodes.  There  is  a  local  directory  (LD)  at 
each  node  indicating  the  resident  files.  With  only  these  local  directories,  one  must  potentially  search  through 
all  the  LD's  in  order  to  locate  a  file,  and  this  would  be  very  inefficient.  To  increase  efficiency,  the  system  has 
a  global  directory  (GD).  The  GD  indicates  which  LD  should  be  searched  for  each  of  the  shared  files.  The 
GD  is  replicated  for  reliability  and  performance.  When  one  needs  a  file,  the  local  operating  system  kernel 
will  first  search  through  its  LD.  and  then  it  will  search  a  nearby  GD.  if  the  file  is  not'  in  its  LD.  The 
introduction  of  GD's  facilitates  file  look-ups,  but  in  a  large  system  the  GD’s  can  become  a  performance 
bottle-neck.  To  further  improve  the  efficiency,  the  local  operating  system  kernel  at  each  node  constructs  a 
partial  global  directory  (PGD)  which  indicates  the  resident  nodes  of  the  frequently  used  remote  files. 

Although  GD’s  and  PGD's  help  in  locating  files,  they  also  make  the  updating  process  more  complicated. 
One  could  define  the  set  of  consistency  constraints  of  all  the  CD’s,  PGD’s  and  LD’s  as  the  requirement  that 
they  must  always  point  to  the  correct  locations  with  respect  to  any  reference.  This  implies  that  when  one 
moves  a  file  from  one  machine  to  another,  the  updating  of  the  source  and  destination  LD’s,  the  GD's  and  all 
the  relevant  PGD's  must  appear  as  an  instantaneous  event  with  respect  to  other  transactions.  This  can  be 
accomplished  by  following  the  two  phase  lock  protocol  [Eswaren  76]  to  lock  the  source  and  destination  LD’s, 
all  the  GD’s  and  all  PGD’s  that  contain  an  entry  indicating  the  transferred  file.  However,  this  approach  has  a 
serious  drawback  in  performance,  because  the  two  phase  lock  requires  that  no  lock  can  be  released  until  all 
the  locks  have  been  obtained.  In  short,  the  entire  system's  file  look-up  activities  might  be  forced  to  or  near  a 
halt  by  a  few  file  transfer  operations.  Therefore,  it  seems  reasonable  to  seek  an  alternative  approach.  One 
simple  alternative  which  permits  a  higher  degree  of  concurrency  is  to  use  "recent"  historical  locations  in  lieu 
of  the  current  ones.  Such  a  tactic  is  quite  common  in  distributed  systems  and  is  modelled  as  follows.  First, 
GD’s  and  PGD’s  can  point  to  any  valid  LD  location,  Le.  the  relationship  among  GD’s,  PGD’s  and  LD’s  is  in 
the  form  of  Cartesian  products.  Second,  we  have  the  following  two  performance  enhancement  schemes. 
First,  GD  will  be  updated  whenever  a  LD  is  updated.  Second,  PGD's  arc  managed  by  a  "fault  driven”  policy. 
When  a  transaction  uses  a  PGD,  it  will  increment  the  "success-counter"  or  the  "failure-counter"  associated 
with  the  PGD  according  to  the  result  from  using  its  information.  The  local  operating  system  kernel  will 
periodically  compute  the  percentage  of  reference  failures.  Should  it  exceed  a  threshold,  the  entries  in  the 
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PGD  will  be  updated  by  using  information  in  the  GD’s.  Obsolete  forwarding  addresses  will  also  be  deleted  in 
this  updating  process.  'Hie  simplest  scheme  for  a  transaction  directed  to  the  "wrong”  LD  is  to  abort  and  try 
later.  There  arc  more  sophisticated  schemes  which  can  enhance  performance.  For  example,  one  is  to  require 
the  transfer  transaction  to  leave  a  forwarding  address  in  the  PGD  of  the  source  node.  From  a  programming 
point  of  view,  performance  enhancement  schemes  are  transaction  specifications  that  can  be  implemented  by 
asynchronous  processes.  Conceptually,  consistency  constraints  define  the  set  of  consistent  states.  Neverthe¬ 
less,  from  an  application  point  of  view  certain  consistent  states  are  considered  to  be  more  favorable  than 
others.  Performance  enhancement  schemes  are  designed  to  increase  the  probability  of  staying  in  the  most 
favorable  consistent  states. 

Generally  speaking,  weaker  consistency  constraints  permit  a  higher  degree  of  concurrency.  However,  once 
the  consistency  constraints  are  weakened,  the  complexity  of  transactions  will  be  increased  for  two  reasons. 
First,  the  process  of  weakening  the  consistency  constraints  enlarges  the  number  of  system  states  that  are 
considered  to  be  consistent.  For  example,  if  the  set  of  consistency  constraints  regarding  GD  and  PGD  is 
relaxed,  a  transaction  must  be  written  to  function  correctly  in  the  case  that  GD  or  PGD  will  only  give  a  valid 
LD  location  but  not  necessarily  the  LD  location  where  the  file  actually  resides.  That  is,  a  transaction  must  be 
able  to  abort  when  the  file  cannot  be  found  or  traced.  Transactions  must  have  the  ability  to  deal  with  all  the 
possible  system  states  that  are  consistent.  Second,  strong  consistency  constraints  generally  ensure  that  the 
system  will  stay  in  a  small  set  of  favorable  states,  although  enforcement  could  be  too  expensive.  When  the  set 
of  permissible  state  is  enlarged,  transactions  must  generally  redesigned  to  better  keep  the  system  in  favorable 
states.  This  also  increases  the  complexity  of  transactions.  The  evaluation  of  the  trade-offs  between  system 
concurrency  and  transaction  complexity  is  an  exciting  new  research  area.  However,  in  this  paper  we  will  not 
analyze  the  performance  trade-offs,  but  rather  focus  on  the  notions  of  database  consistency,  transaction 
correctness,  and  schedule  optimality. 

3.3.2. 2  Data  Objects,  Database  and  Consistency 

A  data  object ’,  O,  is  a  user  defined  smallest  unit  of  data  which  is  individually  accessible  and  upon  which 
synchronization  can  be  performed  (c.g.  locking).  Associated  with  each  data  object  O,  we  have  a  set  Dom(O), 
the  domain  of  O,  consisting  of  all  possible  values  taken  by  O.  The  granularity  of  a  data  object  is  not  important 
to  the  discussion  of  consistency  and  correctness.  For  example,  a  local  directory  can  be  designated  as  a  data 
object  Alternatively,  each  entry  in  this  directory  can  be  designated  to  be  a  data  object  and  the  directory  can 

t 

be  considered  to  be  a  collection  of  data  objects  so  as  to  permit  concurrent  operations  on  the  directory. 

Each  data  object  is  internally  represented  by  triplets,  <name,  value,  version  numbcrX  When  a  data  object  is 
created,  its  initial  value  is  assigned  to  version  zero  of  this  data  object,  e.g.  ”A(0):  =  1”.  When  the  data  object  is 
updated,  a  new  version  of  the  object  is  created  and  the  transaction  works  on  this  new  version.  The  version 
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number  will  be  incremented  when  the  update  is  completed.  For  example,  the  step  "A  :=  A+ 1"  in  an 
transaction  corresponds  to  the  following: 

A[v+1]  :*  A[v];  {where  v  denotes  the  current  version} 

A[v+1]  :a  A[v+1]  +  1; 
v  :*  v+1; 

In  the  following  discussion,  when  we  just  refer  to  the  current  value  of  a  data  object  O,  we  would  write  ”0" 
instead  of  ”0[v]".  The  version  number  representation  will  be  used  only  if  different  versions  of  the  values  of  a 
data  object  are  referred  to. 

The  system  database  D  =  {Ol.  02.  Oa)  is  the  collection  of  all  the  shared  data  objects  in  the  system.  A 
state  of  D  is  an  n-tuple  Y  e  Q  =  fl;=i  Dom(Op.  Associated  with  D,  there  is  a  set  of  consistency  constraints 
in  the  form  of  predicates  on  the  states  of  D.  A  consistent  state  of  D  is  an  n-tuple,  Y,  satisfying  this  set  of 
consistency  constraints.  This  is  indicated  by  ”C(Y)  =  1",  where  C  is  a  Boolean  function  indicating  if  this  set  of 
consistency  constraints  is  satisfied  by  Y.  For  simplicity,  we  will  also  refer  to  this  set  of -consistency  constraints 
by  C.  The  meaning  of  C  is  easily  determined  by  the  context.  The  set  of  all  consistent  states  of  D  is  denoted  by 
U,  where  U  =  {Y  |  C(Y)  =  1}. 

To  illustrate  these  ideas,  we  return  to  the  example  of  Section  3.22.1.  Suppose  that  our  operating  database 
consists  of  only  these  directory  objects  GD’s,  PGD’s  and  LD’s.  A  state  of  the  database  forms  a  description  of 
the  locations  of  system  files. 

3. 3. 2. 3  Consistency  Preserving  Partition  of  Database 

Once  we  decide  upon  the  set  of  consistency  constraints,  we  want  to  determine  the  concurrency  permitted. 
The  concurrency  permitted  by  a  given  set  of  consistency  constraints  is  determined  by  partitioning  the  operat¬ 
ing  system  database  into  disjoint  sets  called  atomic  data  sets  (ADS),  whose  consistency  can  be  maintained 
independent  of  each  other.  Each  atomic  data  set  has  its  own  consistency  constraints,  and  the  conjunction  of 
all  the  ADS  consistency  constraints  is  equivalent  to  the  consistency  constraints  of  the  database.  In  the 
following,  we  first  develop  the  notion  of  a  consistency  preserving  partition  and  show  that  there  is  always  a 
unique  maximal  partition  with  respect  to  a  given  set  of  consistency  constraints. 

Let  I  =  {1,  2, ....  n}  be  the  index  set  of  D.  The  index  i  €  I  specifies  the  data  object  O.e  D.  Let  ir$( Y) 
denote  the  projection  of  an  n-tuple  Y  €  Q  using  the  set  of  indices  SCI.  That  is,  ws(Y)  denotes  the  tuple 
whose  elements  are  die  values  of  the  data  objects  indexed  by  S.  Let  P  =  {S, . S. }  denote  a  partition  of  I. 


Let  V|  be  the  set  whose  elements  arc  the  projections  of  all  the  consistent  states.  X  €  U.  onto  an  arbitrary  index 

set  Sj,  i.c.  V.  =  UX£U{*Sj(X)}. 

Definition:  A  partition  of  the  index  set  I.  P  =  {Sj.  S2. _ .  Sk).  is  said  to  be  consistency  preserving  (CP)  if 

and  only  if, 

V(Y  e  C){  (irs_(Y)  €  Vjt  i  s  1  to  k]  -*  [Y  €  II] }. 

An  atomic  data  set  X  for  a  CP  partition  P  is  the  set  of  data  objects  specified  by  Si  €  P.  The  associated 
partition  of  data  objects  in  D.  Q,  is  called  a  consistency  preserving  partition  of  D. 

The  definition  of  a  CP  partition  states  that  a  CP  partition  has  the  property  that  any  choice  of  the  consistent 
stales  of  the  atomic  data  sets  leads  to  a  consistent  state  of  the  database.  The  following  theorem  shows  that  the 
consistency  constraints  of  the  database  can  be  decomposed  into  sets  of  ADS  consistency  constraints. 

Let  P  be  a  CP  partition  of  I  and  let  S4  C  I  be  the  set  of  indices  which  specify  the  data  objects  in  the  atomic 
data  set  X  C  D.  Let  the  set  of  all  the  consistent  states  of  an  ADS  X  be  Lf  =  UX£  u(irs(X)}. 

Definition:  The  set  of  consistency  constraints  C  whose  truth  set  is  the  consistent  states  of  X  is  called  the 
ADS  consistency  constraints  of  JL,  Le. 

U=M*s.(Y)|C(,rSj(Y))  =  l} 

Theorem  1:  The  conjunction  of  all  the  ADS  constraints  Cj(  i  =  1  to  k,  is  equivalent  to  consistency  con¬ 
straints  C  of  D.  That  is, 

C  =  C^CjA^C^. 

Proof:  Let  U*  be  the  truth  set  of  the  conjunction  of  all  the  ADS  constraints.  Wc  have, 

U*=  {Y|Ci(1rSj(Y)).i  =  1  to  k} 

=  {Y|WSj(Y)€Uiti=ltok}  =  U. 

Hence,  C  =  Cj  A  C2  A  _  C^.  □ 


CP  partitions  exist,  since  the  trivial  partition  {  I  }  is  CP.  Furthermore,  the  CP  partitions  are  partially 
ordered  by  refinement.  That  is,  for  any  CP  partition  P.,  P2;  P,  is  refined  by  P2  if  and  only  if 


V(Sj  €  Pj)  3(Sj  €  PjXSj  C  S-).  A  maximal  CP  partition  is  one  which  is  refined  by  no  other  CP  partition. 
In  the  following,  we  prove  that  there  exists  a  unique  maximal  CP  partition  P. 

The  proof  is  based  on  three  Lemmas.  The  idea  of  Lemma  2-1  is  illustrated  by  the  following  example.  Let 
P1  =  {  {1,  2}.  {3}  }  be  a  CP  partition  of  the  index  set  I  =  {1,  2,  3}.  Suppose  that  A  =  (ar  ap  a3)  and  B  = 
(bp  bp  b2)  arc  two  consistent  states.  Let  S  be  a  partition  set,  cither  (1,  2}  or  {3}.  Lemma  2-1  states  that  the 
two  new  states  which  result  from  swapping  the  projections  of  A  and  B  specified  by  S  are  also  consistent.  That 
is,  (a,,  au,.  b3)  and  (bp  b2,  a3)  are  consistent. 

Define  a  mapping  H^:  Q  x  Q  — *  Q  as  follows,  where  S  Cl  I.  Given  that  Xp  €  fl,  Hs(Xp  X2)  =  Y,  where 
Y  satisfies  irs(Y)  =  and  ir$c( Y)  =  where  Sc  =  I  -  S.  Thus  H$(Xp  X2)  replaces  the 

projections  of  X3  specified  by  S  with  the  projections  of  X2  specified  by  S. 

Lemma  2-1:  Suppose  that  X,,  Xj  €  U.  If  S  is  an  element  of  any  CP  partition  of  I,  then  H$:  U  x  U  — ►  IJ. 

Proof:  If  S  =  L  then  H$(Xp  Xp  =  Xp  and  the  result  follows. 

Let  P  =  {S,  o  p <rk},  k  >  1  be  a  CP  partition. 

Define  WQ  =  XpW(  =  Xpi  =  1  to  k;  so  that  W,  e  U,  i  =  Otok. 

*s^w#)-*spvXi.x2)) 

*  (W.)  =  *^(Hs(Xp  X2)),  i  =  1  to  k. 

Given  that  P  is  CP,  Hs(Xp  XJ  is  therefore  in  U  by  the  definition  of  a  CP  partition.  Thus  H$  maps  pairs  of 
consistent  state  into  a  consistent  state.  □ 

When  we  have  two  or  more  distinct  CP  partitions  of  the  same  index  set  I,  partition  sets  from  distinct 
partitions  could  intersect.  Lemma  2-2  generalizes  Lemma  2-1  by  allowing  the  intersections  to  be  used  for  the 
specification  of  swapping.  For  example,  let  P2  =  {  {1},  {2.  3}  }  be  a  second  CP  partition.  The  intersection  of 
{1,  2}  €  P3  and  {2,  3}  €  P2  is  {2}.  Lemma  2-2  states  that  the  two  states  resulted  from  swapping  the 
projections  of  A  and  B  specified  by  {2}  are  also  consistent  That  is,  (ar  bp  a3)  and  (bp  a2,  b3)  are  consistent. 
We  show  the  consistency  of  (ap  b2,  a3)  as  follows.  First,  we  use  Lemma  2-1  to  swap  the  projections  of  A  and 
B  specified  by  {1}  of  Pp  E  =  (ar  bp  b3)  is  one  of  the  two  resulting  consistent  states.  Next,  swapping  the 
projections  of  A  and  E  specified  by  {3}  of  Pp  we  find  (ap  bp  a3)  to  also  be  a  consistent  state.  We  now  give  a 
general  proof  of  Lemma  2-1 
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Lemma  2-2:  Suppose  that  S  e  Pj  and  a  €  P2.  where  Pt  and  P2  arc  CP.  Then  :  1)  x  U  — *  U. 


Proof:  If  S  e  Pj.  then  there  exists  a  CP  partition  P  such  that  Sc  €  P.  Since  X2,  X2  €  U.  it  follows  that 
Hsc(X1.  X2)  €  U  by  Lemma  2-1.  Therefore,  Hff(Xr  Hgc{X2,  X2))  €  U  as  well.  The  Lemma  follows,  since 
Hff(X1.Hsc(X2.X1))  =  Hsnff(X1,X2).n 

Lemma  2-3  demonstrates  that  the  least  common  refinement  of  any  two  CP  partition  is  also  a  CP  partition. 
For  example,  the  least  common  refinement  of  P2  and  P2.  {{1},  {2}.  {3}},  is  also  CP.  In  this  case,  given  a  set 
of  consistent  states  such  as  (ar  a^  a^,  (b2,  b2,  b3)  and  (cr  c2,  c3),  jve  must  prove  that  {ar  b2,  c3)  is  also 
consistent.  First,  we  apply  the  intersection  of  {1}  and  {1.  2}  to  "A,  B"  and  "A,  C*  respectively.  (ar  b2,  b3) 
and  (a2,  c2,  c3)  arc  two  of  the  four  new  consistent  states.  Next,  we  apply  the  intersection  of  {2,  3}  and  {3}  to 
these  two  new  states.  One  of  the  two  resulting  consistent  states  is  {a^  b2,  c3}.  We  now  give  a  general  proof  of 
Lemma  2-3. 

l  emma  2-3  if  P2  and  P2  are  CP,  then  their  least  common  refinement  is  also  CP. 

Proof:  Let  Pj  =  {Sr  ...  ,  Sm)  and  P2  =  {tXj,  „  ,  ctq}.  Their  least  common  refinement  is  Pj  n  P2  = 
{Cr  .^,CL},  where  Cj  =  S.n<rk  for  some  j,k,i  =  ItoL 

Let  X;  €  U,  i  =  1  to  L  and  Y  €  8  be  given  such  that  vr  (X.)  *  wr  (Y),  i  »  1  to  1_  We  must  prove  Y  €  U 
to  conclude  that  the  P2  n  P2  is  CP. 

Define  a  sequence  {XV,  i  =  1  to  L}  as  follows:  X2*  =  X2,  Xj*  =  Hc  (X^,  Xj).  j  =  2  to  L.  Noting  that  Cj 
=  Sjfl  «rk,  Lemma  2-2  indicates  that  X/  €  U,  j  =  1  to  L.  It  follows  that  X*L  =  Y  €  U.  □ 

Theorem  2:  There  exists  a  unique  maximal  CP  partition. 

Proof:  Suppose  that  there  exists  more  than  one  maximal  CP  partition.  The  least  common  refinement  of 
distinct  maximal  CP  partitions  is  CP  by  Lemma  2-3,  thus  contradicting  the  maximality  assumption.  □ 

Corollary  2:  There  exists  a  unique  maximum  CP  partition  Q  of  D. 

Theorem  2  indicates  that  there  is  a  CP  partition  that  is  "most  refined"  with  respect  to  a  given  set  of 
consistency  constraints.  This  partition  will  allow  the  maximal  concurrency,  of  transactions,  although  any  CP 
partition  can  be  used.  In  the  directory  example  discussed  earlier,  a  consistency  preserving  partition  of  the 
directories  could  be  as  follows: 

•  Each  GD  is  an  atomic  data  set,  with  the  ADS  consistency  constraint  that  each  entry  points  to  a 
valid  node  location. 
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•  Each  PGD  is  an  atomic  data  set.  with  the  ADS  consistency  constraint  that  each  entry  points  to  a 
valid  node  location. 

•  All  the  LD's  arc  placed  in  one  single  atomic  data  set  with  the  consistency  constraints  that  any  file 
name  appears  in  one  and  only  one  LD  where  the  file  resides. 

Note  that  in  this  formulation  both  PGD’s  and  GD’s  are  only  required  to  point  to  any  valid  node  location. 
From  a  performance  enhancement  point  of  view,  they  arc  treated  differently.  GD’s  are  required  to  be 
updated  whenever  a  file  is  moved,  whereas  a  PGD  is  updated  only  when  the  percentile  of  reference  failures 
exceeds  a  threshold.  Note  that  the  required  frequencies'  for  updating  GD’s  and  PGD’s  respectively  are 
performance  enhancement  schemes  designed  to  help  GD’s  and  PGD’s  in  pointing  to  relatively  "recent  his¬ 
torical  locations".  Scheduling  rules  will  not  take  performance  enhancement  schemes  into  consideration. 
Conceptually,  system  consistency  constraints  are  the  laws  defining  the  legal  states  of  the  system  database. 
Scheduling  rules  ensure  that  these  laws  are  observed  by  the  concurrent  executions.  Performance  enhance¬ 
ment  requirements  are  used  to  help  the  system  to  stay  in  the  most  favorable  legal  states.  The  implementation 
of  performance  enhancement  requirements  is  in  the  form  of  asynchronous  processes  and  not  a  part  of 
concurrency  control. 

In  the  following  section,  we  will  develop  the  notion  of  transaction  systems  and  show  how  consistency 
preserving  partitions  can  be  used  to  schedule  transactions  in  a  non-serial izable  fashion. 

3.3.3  A  Model  for  Transaction  Systems 

A  transaction  system  is  a  set  of  transactions  that  share  a  common  database.  Given  a  transaction  system,  a 
modular  scheduling  rule  independently  partitions  the  steps  of  each  transaction  into  equivalent  classes  called 
atomic  step  segments.  Having  partitioned  a  transaction,  one  can  use  "locks",  "time  stamps"  or  other  protocols 
to  ensure  that  the  atomic  step  segments  specified  by  the  rule  will  be  executed  serializably.  For  example,  the 
serializability  theory  is  a  special  modular  scheduling  rule  which  considers  each  transaction  in  the  system  to  be 
a  single  atomic  step  segment.  A  formal  model  of  modular  scheduling  rules  and  their  properties  will  be 
presented  in  section  3.2.4,  in  which  we  show  that  the  setwise  serializable  scheduling  rule  is  optimal  in  the  set 
of  all  the  application  independent  scheduling  rules  and  generalized  setwise  serializable  scheduling  rules  form 
a  complete  class  within  the  set  of  all  the  modular  scheduling  rules.  In  this  section,  we  focus  on  the  consistency 
and  correctness  of  the  schedules  associated  with  these  two  rules. 

A  schedule  z  is  said  to  be  consistent  if  the  execution  of  the  transaction  system  according  to  z  preserves  the 
consistency  of  the  database,  z  is  also  said  to  be  correct  if  the  execution  leads  to  the  satisfaction  of  the 
post-condition  of  each  of  the  transactions.  It  is  important  to  point  out  that  the  concept  of  correctness  applies 
only  to  the  relationship  between  the  inputs  and  outputs  of  each  individual  transaction,  not  the  aggregate  effect 


"nr  ^ 


'  ^ LT.T^ '.TTJl^  '.^  W lp*W ^1-WCTnrsTvrFv-w T  ~ 


of  executing  a  set  of  transactions.  The  aggregate  effect  is  dealt  with  through  the  notion  of  database  consis* 
tency.  Suppose  that  we  have  a  transaction  withdrawing  $5.00,  a  transaction  depositing  $10.00  and  an  account 
with  current  balance  equal  to  zero.  In  addition,  let  the  non-negativity  of  the  account  balance  be  the  consis¬ 
tency  constraint.  The  withdrawal  transaction  aborts  if  it  encounters  a  balance  less  than  $5.00.  Depending 
upon  the  order  of  execution,  the  withdrawal  could  be  cither  successful  or  "bounced".  We  consider  a  schedule 
z  for  these  two  transactions  to  be  correct  if  under  z  each  of  the  two  transactions  does  what  it  is  supposed  to  do, 
independent  of  whether  the  withdrawal  is  successful  or  is  "bounced".  We  consider  the  schedule  to  be 
consistent  if  at  the  end  of  execution  the  account  balance  is  non-negative. 

This  section  is  organized  as  follows.  We  first  study  transaction  systems  composed  of  single  level  trans¬ 
actions.  We  define  the  notion  of  setwise  serializable  schedules  and  prove  their  consistency  and  correctness. 
Next,  extend  our  work  to  nested  transactions.  We  then  introduce  a  new  transaction  syntax  called  a  compound 
transaction  and  define  the  associated  schedules  called  generalized  setwise  serializable  schedules.  We  conclude 
this  section  by  proving  the  consistency  and  correctness  of  generalized  setwise  serializable  schedules.  Finally, 
we  want  to  point  out  that  throughout  this  section,  transactions  step  are  classified  into  "read"  and  "write”.  The 
possible  use  of  a  richer  set  of  primitive  steps  will  be  discussed  in  Section  3.2.4.2. 

3.3.3. 1  Single  Level  Transactions 

The  study  of  transaction  systems  composed  of  single  level  transactions  forms  the  basis  for  our  later  work  on 
nested  transactions  and  compound  transactions.  This  section  is  organized  as  follows.  First,  we  define  the 
syntax  of  single  transactions.  Second,  we  define  the  notions  of  schedules,  equivalent  schedules  and  setwise 
serializable  schedules.  Third,  we  define  the  notion  of  consistency  and  correctness  and  prove  that  setwise 
serializable  schedules  are  consistent  and  correct.  We  conclude  this  section  by  developing  algorithms  for  the 
enforcement  of  setwise  serializability. 

3.3.3. 1.1  Syntax 

In  this  section,  we  define  the  syntax  of  single  level  transactions  and  the  notions  of  pre-  and  post-conditions 
of  a  transaction.  We  first  define  the  syntax  of  single  level  transactions. 

Definition:  A  single  level  transaction  T;  is  a  sequence  of  transaction  steps  (t  r  t^,  ....  tm  )  with  the 
following  syntax: 

<SingleLevelTransaction>  ::  =  BeeinTransaction  <StepList>  EndTransaction. 

<StepList>  ::  =  <Step>  |  <StepList>;<StepList>; 

<Step>  ::=  ReadStep  |  WriteStep 
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A  transaction  step,  either  "read"  or  "write",  is  modelled  as  the  non-divisiblc  execution  of  the  following 
instructions  [Kung  79a]: 


ft<Lt  -Lt  • 

y  lU  ll2 


K) 

y 


where  t. .  represents  step  j  of  transaction  T;  L  is  the  local  variable  used  by  t. .  to  store  the  value  read.  O 
y  lLj  y  ly 

is  a  data  object  accessed  by  t .  and  f.  represents  the  computation.  In  this  model,  every  step  reads  and  then 

y 

writes  a  data  object  A  read  step  is  interpreted  as  writing  the  value  read  back  to  the  data  object,  i.e.  the  f 

y 

associated  with  a  read  step  is  the  identity  function. 


We  now  define  the  notions  of  input  steps,  output  steps,  precondition  and  post-condition  of  a  transaction. 

Definition:  Let  T  =  {t^,  _  }  be  a  transaction.  Let  the  data  object  O  be  the  one  accessed  (read  or 

written)  by  step  Ly.  Step  ty  is  said  to  be  an  input  step  if  it  is  the  step  in  T.  that  first  accesses  data  object  O.  Step 
ty  is  said  to  be  an  output  step  if  it  is  the  step  in  TV  last  accessing  O.  That  is,  for  every  data  object  O  accessed  by 
T  there  are  an  input  step  and  an  output  step  associated  with  O.  Note  that  when  there  is  only  one  step  in  T 
accessing  0,  then  this  step  is  both  an  input  and  an  output  step. 

Definition:  Let  O^ty],  j  =  1  to  k,  be  the  set  of  values  read  by  the  input  steps  ofTm,  where  v.  denotes  the 
version  of  a  data  object  that  is  input  to  a  transaction.  Let  the  index  set  of  O^fv.J,  j  =  1  to  k,  be  Iffl.  The  input 
values  to  Tffl,  O^fv.],  j  =  1  to  k,  are  said  to  satisfy  the  pre-condition  of  Tm,  if  and  only  if 

3(X€UXw,  (X)  =  0  {vi],j  =  ltok.) 
in 

That  is,  a  transaction  must  function  properly  if  all  the  values  input  to  the  transaction  could  have  come  from 
a  consistent  state  of  the  database. 


Definition:  Let  O^Vj],  j  =  1  to  k,  be  the  set  of  values  written  by  the  output  steps  of  transaction  Tm,  where 
vf  denotes  the  version  of  a  data  object  output  by  the  transaction.  The  post-condition  of  transaction  Tffl  is  the 
specification  of  the  output  values  of  Tffl  as  functions  of  the  input  values. 


°nyfVfJ  =  W°m.llVi]*  -  =  1  »*• 


3.3.3. 1 .2  Schedules  and  Setwise  Serializability 

Given  a  consistency  preserving  partition  of  the  database,  the  setwise  serializable  scheduling  rule  partitions 
each  transaction  into  a  special  form  of  atomic  step  segments  called  transaction  ADS  segments.  /.  transaction 
ADS  segment  of  a  transaction  T  is  simply  all  the  steps  in  T  that  access  the  same  ADS.  A  schedule  z  is  said  to 
be  setwise  serializable  if  all  the  transaction  ADS  segments  in  the  system  are  executed  scrializably  under  z. 
The  purpose  of  this  section  is  to  formally  define  the  notion  of  setwise  scrializability  and  to  identify  the 
conditions  under  which  a  schedule  is  setwise  serializable. 

Definition:  A  transaction  ADS  segment  is  the  sequence  of  steps  in  a  transaction  that  accesses  the  same 
ADS.  Let  '{'(i.  U.)  denote  the  transaction  ADS  segment  of  transaction  T  accessing  ADS  jL  Let  ty  >  tV  m 
denote  that  step  ty  is  executed  after  step  t^. 

1.  'I'O,  U.)  =  {t  |  (t  e  T)  A  (t  reads  or  writes  a  data  object  in  ADS  J.) } 

1  V<(Vtu  €  *«,.*))  A  (llj>lu))<(tu.tu  £  T)  A 

We  now  define  the  notions  of  transaction  systems  and  their  schedules.  Next,  we  define  the  notion  of  a 
setwise  serial  schedule. 

Definition:  A  transaction  system  T  is  a  finite  set  of  transactions  {Tr  ....  TJ  operating  upon  the  shared 
database  D. 

Definition:  A  schedule  z  for  transaction  system  T  is  a  totally  ordered  set  of  all  the  steps  in  the  transaction 
system  T  =  {Tr  Tn>  such  that  the  ordering  of  steps  of  Tjt  i  =  1  to  n.  in  the  schedule  is  consistent  with  the 
ordering  of  steps  in  the  transaction  Tj,  i  =  1  to  n. 

[V  (tX  (t  €  z)-(t  €  T))J  A  [V(T.  €  T)V((ty  €  Tj)  A  (tu>ti.))((ty,«u  €  z)  A  (t-k  >  ty)  )] 

Definition:  A  setwise  serial  schedule  is  a  schedule  in  which  transaction  ADS  segments  accessing  the  same 
ADS  do  not  overlap.  Let  tA1  and  tiAjn  denote  the  first  step  and  the  last  step  of  transaction  ADS  segment  SKi, 
J-)  respectively.  Let  Q  be  a  consistency  preserving  partition  of  D.  A  schedule  z  for  transaction  system  T  is 
said  to  be  setwise  serial  if  and  only  if  under  z, 

Vu€Q)V(Tj€T)V(tA€z  A  v  (^>1*-“)) 

where  A  represents  any  step  accessing  ADS  U  in  the  transaction  system  T. 

Having  defined  the  notion  of  setwise  serial  schedules,  we  want  to  define  a  setwise  serializable  schedule  as 
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one  which  is  computationally  equivalent  to  a  setwise  serial  schedule.  This  requires  us  to  first  define  the 
notion  of  equivalent  schedules. 

Definition:  Let  O,  s  0.  denote  step  t.  and  step  L  _  read  or  write  the  same  data  object  A  schedule  z 
Tj  Tun  g  *•” 

for  transaction  system  T  is  said  to  be  equivalent  to  another  schedule  z  for  T,  if  for  every  pair  of  steps  t .  and 
tfcjn  in  z  and  z*. 

V«VV»€T>  A  (\SVK(V^a)  A  A  W>l 

That  is.  the  partial  orderings  of  steps  on  each  of  the  shared  data  objects  in  z  and  z*  are  identical.  In 
addition,  the  orderings  of  steps  in  both  z  and  z*  are  also  consistent  with  the  internal  orderings  of  steps  in  each 
of  the  transactions  in  T.  since  z  and  z*  are  schedules  for  the  same  transaction  system  T.  Hence,  for  any  given 
inidal  state  of  D  the  executions  of  T  according  to  z  and  z*  give  the  same  computation  in  the  sense  that  they 
yield  the  same  sequences  of  values  for  each  data  object  in  the  database  and  the  same  sequences  of  values  for 
each  of  the  local  variables  (states)  of  each  transaction  in  T. 

Definition:  A  schedule  z  for  transaction  system  T  is  said  to  be  setwise  serializable  if  there  exists  a  setwise 
serial  schedule  z*  for  T  such  that  z  and  z*  are  equivalent. 

Setwise  serializability  is  determined  by  the  transaction  ADS  segment  precedence  graph.  In  the  following, 
we  first  define  the  notion  of  a  general  precedence  graph. 

Definition:  Let  z  be  a  schedule  for  transaction  system  T.  Let  r  be  a  sequence  of  transaction  steps  in  T.  T  be 
a  finite  set  of  r.  Let  2  be  a  set  of  data  objects  in  D.  A  precedence  graph  G(z,  I\  2)  is  a  directed  graph  whose 
nodes  are  elements  of  T.  An  arc  <rjt  r.>,  which  represents  that  r  proceeds  r.,  exists  if  the  execution  of  T 
according  to  z  results  in  one  of  following  throe  conditions: 

1.  there  exists  a  data  object  O  e  2  for  which  -r  reads  from  O  immediately  before  r.  writes  into  O; 

1  there  exists  a  data  object  O  €  2  for  which  r  writes  into  0  immediately  before  r.  reads  from  O: 

3.  there  exists  a  data  object  0  €  2  for  which  r  writes  into  0  immediately  before  r.  writes  into  O. 

As  an  example,  the  familiar  transaction  system  precedence  graph  for  schedule  z  is  represented  by  G(z,  T, 
D),  where  z  is  a  schedule  for  transaction  system  T,  and  D  is  the  database. 

Definition:  Let  2aDS(T)  be  the  partition  of  the  steps  of  transaction  T  such  that  each  element  of  the 
partition  is  a  transaction  ADS  segment.  Let  T  be  the  set  of  all  the  transaction  ADS  segments  in  transaction 
system  T,  i.e.  T  =  {'P  |  *  e  -  ads^)  a  1\€T  }.  The  precedence  graph  G(z,  1\  D)  is  called  the  transaction 
ADS  segment  precedence  graph  for  schedule  z. 


Let  "Cycle(G)  =  (T  denote  that  the  graph  G  contains  no  cycle. 

We  now  prove  that  a  schedule  z  for  a  transaction  system  T  is  setwise  serializable  if  the  transaction  ADS 
segment  precedence  graph  for  z  contains  no  cycle.  We  must  show  that  there  always  exists  a  setwise  serial 
schedule  z*  in  which  the  partial  ordering  of  steps  on  each  of  the  data  objects  is  the  same  as  that  in  z.  To 
demonstrate  this,  we  use  the  procedure  known  as  topo,  >gicat  sorting  [Aho  83].  Topological  sorting  creates  a 
total  ordering  that  is  consistent  with  all  the  partial  orderings  represented  by  a  directed  acyclic  graph.  We  first 
use  this  procedure  on  the  transaction  ADS  segment  graph  to  create  a  list  of  partial  setwise  serial  schedules, 
each  of  which  is  a  serial  schedule  for  all  the  transaction  ADS  segments  accessing  the  same  ADS.  Note  that  the 
transaction  ADS  segment  graph  does  not  consider  the  step  orderings  between  different  transaction  ADS 
segments  defined  by  the  individual  transactions.  For  example,  a  transaction  T  with  four  steps  can  have  its  1st 
and  3rd  steps  accessing  ADS  -Xj  while  the  2nd  and  4' 1  steps  accessing  ADS  That  is,  {t.  r  1 3)  and  {t  2,  L  4} 
are  the  two  transaction  ADS  segments  of  Tr  The  precedence  relation  from  step  1  to  2,  2  to  3  and  3  to  4  are  not 
considered  by  the  transaction  ADS  segment  graph.  To  create  the  setwise  serial  schedule,  we  must  take  these 
internal  step  orderings  into  account  Therefore,  we  now  create  a  transaction  system  step  precedence  graph. 
Each  node  in  the  graph  is  a  step  in  T.  We  first  draw  arcs  to  represent  all  internal  orderings  between  steps  in 
each  of  the  transactions.  An  arc  is  drawn  from  node  i  to  node  j  if  step  i  immediately  preceeds  step  j  in  the 
same  transaction.  Next,  we  draw  arcs  to  represent  the  partial  orderings  that  are  defined  by  the  partial  setwise 
serial  schedules.  An  arc  is  drawn  form  node  k  to  node  m  if  step  k  immediately  proceeds  step  m  in  one  of  the 
partial  setwise  serial  schedules.  Once  this  is  done,  we  use  the  topological  sorting  procedure  to  create  a  total 
ordering  which  then  gives  the  required  setwise  serial  schedule. 

Theorem  3:  A  schedule  z  for  transaction  system  T  is  setwise  serializable  if  there  is  no  cycle  in  the  transaction 
ADS  segment  precedence  graph  for  z,  Le.  Cyde(G(z,  T,  D)  =  0). 

Proof:  We  first  use  the  topological  sorting  procedure  on  the  transaction  ADS  segment  graph  to  create  a  list 
of  partial  setwise  serial  schedules.  There  must  be  a  node  in  the  transaction  ADS  segment  precedence  graph 
for  z  that  has  no  entering  arcs.  Otherwise,  there  is  a  cycle  in  the  graph.  Suppose  that  this  node  corresponds  to 
*0,  X).  List  transaction  ADS  segment  'Pfi,  -4)  on  the  partial  setwise  serial  schedule  zA  for  ADS  X  Remove 
♦(i,  X)  from  G(z,  T,  D)  and  repeat  the  procedure  until  all  the  nodes  are  removed  from  G(z,  T,  D).  We  now 
create  the  transaction  system  step  precedence  graph  in  which  each  node  represents  a  step  in  T.  We  draw  an 
arc  from  node  i  to  node  j  if  step  i  immediately  proceeds  step  j  in  the  same  transaction.  We  also  draw  an  arc 
from  node  k  to  node  m  if  step  k  immediately  preceeds  step  m  in  one  of  the  partial  setwise  serial  schedules. 
Having  completed  the  graph,  we  perform  the  topological  sorting  procedure  on  the  graph.  The  resulting  total 
ordering  of  steps  is  a  setwise  serial  schedule  z*.  The  total  ordering  of  steps  in  z*  is  consistent  with  the  internal 
orderings  of  steps  defined  by  the  transactions  in  T  and  is  consistent  with  the  partial  orderings  of  steps  on  each 
of  the  data  objects  in  G(z,  T,  D).  □ 


51 


It  is  worthwhile  to  point  out  that  setwise  serializable  schedules  do  not  generally  prohibit  cycles  from  being 
formed  in  the  transaction  system  precedence  graph,  they  only  prohibit  cycles  from  being  formed  in  the 
transaction  ADS  segment  precedence  graph.  Setwise  scrializability  reduces  to  scrializability  if  the  database 
consists  of  one  ADS. 

3. 3. 3. 1.3  Consistency  and  Correctness 

When  scrializability  is  used  as  the  criterion  of  correctness  for  concurrency  control,  the  notions  of  data 
consistency  and  transaction  correctness  follow  directly  from  the  assumptions  that  each  transaction  terminates, 
preserves  the  consistency  of  the  database  and  produces  correct  results  when  executing  alone.  When  schedules 
are  non*serializable.  transactions  can  no  longer  be  regarded  as  if  they  are  executing  alone.  Therefore,  we  must 
prove  the  consistency  and  correctness  of  any  non-serializable  schedule.  We  consider  a  schedule  to  be  consis¬ 
tent  and  correct,  if  the  execution  of  the  transaction  steps  according  to  the  schedule  preserves  the  consistency 
of  the  database  and  satisfies  the  post-condition  of  each  of  the  transactions.  Our  fundamental  assumptions 
about  a  transaction  are  as  follows: 

•  A1  Termination :  A  transaction  is  assumed  to  terminate. 

•  A2  Transaction  Correctness:  A  transaction  is  assumed  to  produce  results  that  satisfy  its  post¬ 
condition  when  executing  alone  and  when  the  database  is  initially  consistent. 

•  A3  Data  Consistency.  A  transaction  is  assumed  to  preserve  the  consistency  of  the  database  when 
executing  alone. 

Definition:  A  transaction  T  is  said  to  be  consistent  and  correct  if  and  only  ifT  satisfies  assumptions  Al,  A2 
and  A3.  A  transaction  system  T  is  said  to  be  consistent  and  correct  if  and  only  if  all  the  transactions  in  T  are 
consistent  and  correct. 

Definition:  A  schedule  z  for  transaction  system  T  is  said  to  be  consistent  if  and  only  of  the  execution  of  T 
according  to  z  preserves  the  consistency  of  the  database  D. 

Definition:  A  schedule  z  for  transaction  system  T  is  said  to  be  correct  if  and  only  if  the  execution  of  T 
according  to  z  satisfies  the  post-condition  of  each  of  the  transactions  in  T. 

Before  proving  that  setwise  serializable  schedules  are  both  consistent  and  correct,  we  need  to  define  the 
notion  of  equivalent  executions  of  a  given  transaction  under  different  schedules. 

Definition:  Let  z  and  z*  be  two  schedules  for  transaction  system  T.  Let  t  be  a  step  of  transaction  T  in  T.  Let 
the  values  of  the  data  object  accessed  by  t  in  z  and  z*  be  O  and  O*  respectively.  Let  the  values  of  the  local 
variable  associated  with  t  in  z  and  z*  be  L  and  L*  respectively.  Transaction  T  is  said  to  be  executed  equiv¬ 
alently  under  z  and  z*  if  and  only  if 


V(t€T|X(0  =  0*)A(L  =  L*)) 


Theorem  4:  IfT  is  executed  equivalently  under  two  different  schedules  z  and  z*  then  the  post-condition  of 
T  will  cither  be  satisfied  under  both  z  and  z*  or  not  satisfied  under  both  z  and  z". 

Proof:  Let  the  values  input  to  T  be  O^Jv.].  ....  O^jv.].  Let  the  values  output  by  T;  be  (Y  Jv^  ....  0^k[vfJ. 
The  post-condition  of  T  is  the  specification  of  the  output  values  of  T  as  some  functions  of  the  input  values: 

0^[vj]  =  f(0(v.] . (TJvJ).  j  =  1  to  k.  It  follows  from  the  definition  of  equivalent  executions  that  all  the 

input  values  to  and  the  output  values  from  T  under  z  and  i  arc  identical.  Therefore,  the  post-condition  ofT 
will  be  either  satisfied  under  both  z  and  z*  or  not  satisfied  under  both  z  and  z'.  □ 

We  now  prove  that  setwise  serial  schedules  are  consistent  and  correct  The  proof  is  organized  into  three 
Lemmas.  Let  T;  be  a  consistent  and  correct  transaction.  In  Lemma  5-1,  we  prove  that  X  preserves  the 
consistency  of  each  of  the  accessed  atomic  data  sets  and  produces  correct  results  when  executing  alone.  This 
result  is  valid  even  if  the  database  as  a  whole  is  inconsistent  In  Lemma  5-2,  we  further  prove  that  at  the  end 
of  executing  a  transaction  ADS  segment  'Pfi,  jt)  of  T,  the  consistency  of  JL  has  been  already  preserved.  In 
addition,  the  output  values  of  data  objects  in  JL  are  correct  at  the  end  of  'Pfi,  U.).  We  need  not  wait  for  the  end 
of  Tt  to  know  these  results.  In  Lemma  5*3.  we  relax  the  executing  alone  condition.  We  show  that  the  results 
of  Lemma  5-2  are  still  valid  for  any  ADS  jL,  as  long  as  jL  is  consistent  at  the  beginning  of  transaction  segment 

Let  Q  =  {Xj,  be  a  given  CP  partition  of  D. 

Definition:  An  ADS  Xj  is  said  to  be  accessed  by  a  transaction,  if  this  transaction  reads  or  writes  one  or  more 
data  objects  in  Xj. 

Lemma  5-1:  Let  T.  be  a  consistent  and  correct  transaction.  If  Ti  executes  alone  and  if  the  states  of  the 
atomic  data  sets  accessed  by  T.  are  initially  consistent,  then  at  the  end  of  T  the  state  of  each  of  the  accessed 
atomic  data  sets  is  consistent,  and  the  values  output  by  T;  satisfy  the  post-condition  ofT. 

Proof:  let  X^,  j  =  1  to  t,  be  the  atomic  data  sets  accessed  by  transaction  T.  Let  Y  €  2  be  a  state  of  D  such 

that  C. .(»,  (Y))  =  L  j  =  1  to  k.;  where  C. .  represents  the  ADS  consistency  constraints  of  X. and  S. . 
*v  1  y  y 

represents  the  index  set  of  X..  Now  let  X  be  a  consistent  state  of  the  database  such  that  v.  (X)  =  w-  (Y),  j 

kl 

=  1  to  kj.  Next,  we  let  Tj  execute  alone  with  database  initially  in  state  X.  We  now  prove  that  with  either  X  or 
Y  as  initial  state,  the  executions  of  Tt  are  equivalent. 

By  assumptions  A1-A3,  with  X  as  initial  state,  T.  produces  correct  results  and  preserves  the  consistency  of 
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the  database.  It  follows  from  theorem  l  that  T  preserves  the  consistency  of  all  the  atomic  data  sets.  Let  the 


values  of  the  local  variable  and  the  data  object  in  the  execution  with  initial  state  X  be  L  ,  and  O  ,  ;  and  that 

•  ,u  u 

with  initial  state  Y  be  L,  and  O,  .  We  must  prove  that  L,  =  L  ,  and  O,  =  O  ,  at  each  step  of  the 
XI  XI  Xl  \l  XI  Xl 

transaction.  Recall  that  the  syntax  of  a  transaction  step  is  as  follows. 


=  ft  <Lt. 

XI  \l 


Since  the  initial  states  of  all  accessed  AOS  are  equal  with  either  X  or  Y  as  the  initial  state,  it  follows  that  the 

initial  values  of  the  accessed  data  objects  are  equal.  Hence,  at  the  first  step  of  T.,  L,  =  L*  .  In  addition. 

•  .  .  U  lU 

O,  =  f.  (L,  )  =  f,  (L  ,  )  =  O  ,  .  Next,  L,  =  L  ,  ,  because  step  two  either  reads  the  initial  value  of 
*U  XI  *U  *U  *U  lLl  u2 

a  data  object  or  the  value  of  the  data  object  output  by  step  1.  Similarly,  O,  =  O  . 

XI  LU 

Now  suppose  that  these  local  variable  and  data  object  value  pairs  are  equal  from  steps  1  to  r.  That  is,  L,  = 
.  «  .  th 

L  and  O  =0  ,  h  =  1  to  r.  We  show  that  L.  =  L  .  This  follows  because  step  r+ 1  either 

uh  uh  uh  ur+1 

reads  the  initial  value  of  a  data  object  or  a  data  object  which  has  been  output  by  some  step  between  1  to  r.  It 

follows  that  O,  =  O*,  .  By  induction,  the  final  values  of  accessed  data  objects  with  either  X  or  Y  as 

Tjr+1  tr+ 1 

initial  state  are  equal  Since  the  values  of  data  objects  in  j  =  1  to  kj.  not  accessed  by.  T.  remain 
unchanged,  they  must  be  equal  at  the  end  of  the  transaction  with  either  X  or  Y  as  the  initial  state.  Let  and 

W  be  the  state  of  D  at  the  end  of  executing  T.  with  X  and  Y  as  initial  states  respectively.  We  have  w-  (W  ) 

y  1  U 

=  vs  (Wp,  j  =  1  to  k.  The  execution  using  X  as  the  initial  state  is  assumed  to  preserve  the  consistency  of 

each  accessed  atomic  data  sets;  so  must  be  the  execution  using  Y  as  the  initial  state.  Since  the  two  executions 

of  Tj  are  equivalent  and  the  execution  with  X  as  initial  state  produces  correct  results;  so  must  be  the  execution 

using  Y  as  initial  state.  □ 


There  are  two  implications  from  this  Lemma.  Fust,  after  the  decomposition  of  the  database,  a  programmer 
is  only  required  to  know  and  maintain  the  consistency  constraints  of  the  atomic  data  sets  accessed  by  his 
transaction.  Without  a  CP  partition,  everyone  must,  in  principle,  know  all  the  database  consistency  constraints 
in  order  to  verify  that  one’s  transaction  will  not  violate  any  of  them.  Second,  this  Lemma  implies  that  a 
transaction  can  still  function  properly  as  long  as  the  accessed  atomic  data  secs  arc  consistent,  even  if  the  rest  of 
the  atomic  data  sets  are  inconsistent.  This  is  useful  in  recovery  management,  although  this  topic  is  outside  the 
scope  of  this  paper. 


Lemma  5-2:  Let  the  atomic  data  sets  accessed  by  transaction  T  be  j  =  1  to  k.  If  *4^.,  j  =  1  to  kjt  are 
initially  consistent  and  if  Tj  executes  alone,  then  at  the  end  of  transaction  ADS  segment  *(i,  the 


consistency  of  is  preserved.  Furthermore,  the  values  of  data  objects  tn  -4.^  output  by  r(i,  jL^)  arc  correct 
at  the  end  of  'Hi.  Xp. 

Proof:  At  the  end  of  the  transaction  ADS  segment  +0,  -4.^),  j  =  1  to  ki%  the  data  objects  in  j  =  1  to  k. 
are  neither  read  or  written  again.  It  follows  that  the  values  of  the  data  objects  in  X^.  j  =  1  to  kjt  are  the  same 
as  at  the  end  of  the  transaction.  By  Lemma  5-1.  at  the  end  of  the  transaction,  the  consistency  of  each  of  the 
atomic  data  sets  is  preserved,  and  the  values  of  the  data  objects  output  by  T  are  correct,  so  must  be  at  the  end 
of  each  of  the  transaction  ADS  segments.  □ 

Lemma  5-3:  In  a  setwise  serial  schedule,  if  at  the  beginning  of  a  transaction  ADS  segment,  *0,  X  ),  =  1  to 
k.,  ADS  X^  is  initially  consistent,  then  X^  is  consistent  at  the  end  of  'Hi,  X^),  and  the  values  of  each  of  the 
data  objects  in  X^  output  by  T  are  correct 

Proof:  Let  the  atomic  data  sets  accessed  by  T  be  jL.,  j  =  1  to  L.  Now  let  T  execute  alone  in  a  serial 
schedule  z*  with  the  initial  states  of  X.^,  j  =  1  to  k.,  being  identical  to  the  initial  states  of  X^,  j  =  1  to  k.,  in 
the  setwise  serial  schedule  z. 

Let  the  values  of  the  local  variables  and  data  objects  in  the  serial  schedule  z*  be  L*t  and  O*.  ;  and  those  in 

V 

setwise  serial  schedule  z  be  L.  and  O,  .  We  now  prove  that  the  executions  of  T.  under  z  and  z  are 

V  V  * 

equivalent  Recall  that  the  syntax  of  a  transaction  step  is  given  by 

L.  :=  O. 


Ot  :  =  f  (L  .  L  ) 

V  \i  Ha  Ha 

Since  the  initial  states  of  ADS  X.,  j  =  1  to  t,  are  equal  in  both  schedules,  the  initial  values  of  all  the  data 

objects  in  X. ,,  j  =  1  to  k.,  are  equal.  Therefore,  the  first  steps  in  both  schedules  input  the  same  value,  Le.  L. 

=  L*  .  In  addition,  O.  =  f"  (L,  )  =  f.  (L*  )  =  O*  .  Next,  L,  =  L*.  ,  because  step  two  either 

Ha  Ha  Ha  Ha  Ha  Ha  Ha  Ha  Ha 

reads  the  initial  value  of  a  data  object  or  the  value  of  the  data  object  output  by  step  1.  Similarly,  0,  =  O  ,  . 

Ha  Ha 

Now  suppose  that  these  local  variable  and  data  object  value  pairs  are  equal  from  steps  1  to  r.  That  is.  L.  = 
L  and  O,  =  O  ,  ,  h  =  1  to  r.  We  show  that  L.  =  L*,  .  This  follows  because  step  r  + 1  either 

ui  Hji  Hji  Hj+i  Hj+i 

reads  the  initial  value  of  a  data  object  or  a  data  object  which  has  been  output  by  some  steps  between  1  to  r.  It 

follows  that  O  =  0*  .  Therefore,  the  final  values  of  accessed  data  objects  in  both  schedules  are 

H.r+1  Hj+l 

equal  at  the  end  of  each  transaction  ADS  segment.  In  addition,  data  objects  in  X^,  j  =  1  to  k.(,  not  accessed  by 
T^  remain  unchanged  and  therefore  equal  at  the  end  of  each  transaction  ADS  segment  for  both  schedules.  It 
follows  from  lemma  5-2  that  at  the  end  of  'Hi,  X^),  the  X^,  j  =  1  to  t.  we  consistent,  and  the  value  of  each 
of  the  data  object  in  X^,  j  =  1  to  kjt  output  by  T  is  correct.  □ 


Theorem  S:  A  setwise  serial  schedule  is  consistent  and  correct 

Proof:  Let  Q  =  {U.r _ 4k}  be  a  CP  partition  of  l).  Let  the  initial  states  of  each  of  the  ADS’s  be  ZA  [0),  j  = 

1  to  k.  These  initial  states  arc  assumed  to  be  consistent. 

Since  a  schedule  is  a  totally  ordered  set  of  steps  from  all  the  transactions,  each  of  which  terminates,  there 

must  exist  a  transaction  ADS  segment  'Hi,  -C)  which  first  finishes  its  computation.  Let  the  associated  ADS 

state  be  Z.  fl].  Since  there  is  no  interleavings  among  transaction  ADS  segments  accessing  the  same  ADS  in  a 

setwise  serial  schedule.  Z .  [1]  must  be  output  by  a  transaction  which  has  used  only  the  initial  states  that  were 
‘ j 

assumed  to  be  consistent.  By  Lemma  5*3,  Z .  [1]  is  consistent,  and  the  values  of  data  objects  in  X  output  by 
♦(i.  X)  are  correct.  Consider  now  the  output'  of  the  second  transaction  ADS  segment  produced  by  the 
schedule.  Since  it  can  use  only  Z.  [1]  or  Z.  [0],  m  =  1  to  k  and  m  ^  j,  at  the  end  of  this  second  transaction 

j  Am 

ADS  segment,  the  accessed  atomic  data  set  is  in  a  consistent  state  and  the  output  values  are  correct  by  Lemma 
5*3.  Now  assume  that  the  first  n  transaction  ADS  segments  produce  consistent  and  correct  results.  The  n+  1st 
must  also  by  the  same  argument  By  induction,  the  ADS  state  produced  by  each  of  the  transaction  ADS 
segments  is  consistent,  and  the  values  of  the  data  objects  output  in  each  ADS  at  the  end  of  the  transaction 
ADS  segment  satisfy  the  post-condition.  It  follows  that  a  setwise  serial  schedule  is  consistent  and  correct  □ 

Corollary  5:  Setwise  serializable  schedules  are  consistent  and  correct 

Throughout  this  section,  the  choice  of  atomic  data  sets  has  been  arbitrary.  This  is  because  the  theorems 
apply  to  any  CP  partition  whether  maximal  or  not  If  the  CP  partition  consists  of  a  single  ADS.  then  setwise 
serializable  schedules  reduce  to  serializable  schedules. 

3.3.3. 1 .4  Algorithms  for  Maintaining  Setwise  Serializability 
We  have  shown  that  if  the  database  has  been  partitioned  into  consistency  preserving  atomic  data  sets,  then  a 
setwise  serializable  schedule  is  consistent  and  correct  To  enforce  setwise  serializability,  we  only  need  to 
slightly  modify  the  algorithms  developed  for  the  serializability  theory.  For  example,  we  can  modify  the  two 
phase  lock  protocol  [Eswaren  76]  as  follows. 

Definition:  A  setwise  two  phase  lock  protocol  requires  a  transaction  not  to  release  any  lock  on  any  data 
object  of  an  atomic  data  set  until  all  the  locks  in  this  atomic  data  set  have  been  acquired.  Once  any  lock  in  an 
atomic  data  set  has  been  released,  no  more  data  objects  in  the  same  atomic  data  set  can  be  locked. 

Theorem  6:  A  setwise  two  phase  lock  protocol  guarantees  setwise  serializability. 

EjCOf:  Suppose  that  the  claim  is  false.  Then  at  least  one  of  the  ADS  precedence  graph  contains  a  cycle,  such 


as  Tj  >  T2  >  Tk  >  Tr  This  implies  that  a  lock  of  Tj  follows  an  unlock  of  T,.  This  contradicts  the  assumption 
that  the  locking  protocol  is  setwise  two  phase.  □ 

Finally,  we  want  to  comment  on  the  possible  structures  of  an  atomic  data  set.  It  is  not  necessary  that  the 
structures  be  a  single  level.  An  atomic  data  set.  like  a  general  purpose  database,  can  have  structure.  For 
example,  an  atomic  data  set  can  have  a  tree  structure.  In  this  case,  the  tree  protocol  [Silbcrschatz  80]  can  be 
used  to  enforce  setwise  scrializability.  This  protocol  requires  that, 

•  except  for  the  first  item  locked,  no  item  can  be  locked  unless  a  lock  is  currently  held  on  its  parent. 

•  no  item  is  ever  locked  twice. 

Note  that  the  first  item  need  not  be  the  root,  and  the  locking  need  not  to  be  two  phase. 

3.3.3. 2  Nested  Transactions 

In  the  early  work  on  serializability  theory,  a  transaction  was  modelled  as  a  sequence  of  steps.  However,  it  is 
natural  to  write  transactions  in  a  nested  form,  in  which  sub-transactions  can  be  executed  in  parallel  and 
invoked  by  higher  level  ones.  Recently,  serializable  nested  transactions  have  been  studied  by  [Gray  81,  Moss 
81,  Lynch  83b,  Been  83].  From  a  concurrency  control  point  of  view,  the  new  issue  associated  with  serializable 
nested  transactions  is  how  to  provide  an  "executing  alone"  environment  for  a  parallel  program.  This  can  be 
illustrated  by  the  "lock  passing"  problem  among  parallel  sub-transactions.  Suppose  that  a  data  object  is 
shared  by  several  sub-transactions  of  the  same  transaction.  A  sub-transaction  which  first  accesses  this  data 
object  must  be  able  to  pass  the  "lock”  to  other  sub-transactions.  If  it  releases  the  "lock”  to  other  transactions, 
the  rest  of  sub-transactions  needing  this  object  may  face  unpredictable  modifications  to  this  data  object 
caused  by  other  transactions.  The  nested  transaction,  as  a  whole,  can  no  longer  be  considered  to  be  executing 
alone. 

Given  a  nested  transaction,  the  setwise  serializable  scheduling  rule  partitions  the  steps  of  the  transaction 
into  transaction  ADS  segments.  Due  to  the  nested  structure,  a  transaction  ADS  segment  could  be  distributed 
in  several  sub-transactions  that  can  be  executed  parallelly.  We  prove  that  the  consistency  of  the  database  will 
still  be  preserved  and  the  post-condition  of  each  of  the  transactions  will  be  still  be  satisfied  as  long  as  the 
schedule  for  the  transaction  system  is  setwise  serializable. 

3. 3. 3. 2.1  Syntax 

We  can  visualize  a  nested  transaction  as  being  organized  in  the  form  of  a  tree.  Nodes  in  the  tree  are 
sub- transactions  and  leaves  are  steps.  The  execution  of  the  transaction  is  defined  by  the  partial  order  of  the 
tree. 

Definition:  A  nested  transaction,  TN,  is  a  partially  ordered  set  of  steps  with  the  following  syntax: 
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<NcstcdTransaction>  ::=  BcginTransaction  <NestcdTransactionBodv>  KndTransaction. 


<NcstcdTransactionBody>  ::  =  BceinScrial  <SubTransactionList>  F.ndScrial  | 

BeginParnllcl  <SubTransactionList>  EndParallel 

<S ubTransac do n List>  ::=  <SubTransacuon>  |  <SubTransactionListXSubTransactionList> 

<SubTransacuon>  ::=  <Step>  |  <SubTransaction>;<SubTransaction>;  |  <NestedTransactionBody> 


<Step>  ::=  ReadStep  |  WritcStep 

A  step,  either  "read'*  or  "write”,  is  modelled  as  an  indivisible  execuuon  of  the  following  two  instructions. 


where  t . .  is  the  It16  step  at  level  j  of  transacuon  T.N,  L  is  the  local  variable  used  by  step  l .  O  is  the 
uat  lljjt  \ ,  U.k 


data  object  accessed  by  step  t^,  and  fL  represents  the  computation  performed  by  step  t,j  t.  Note  that  step 
k  can  use  its  own  local  variable  and  local  variables  associated  with  steps  preceeding  it  in  the  partial  ordering 
of  transaction  steps.  A  "read"  step  is  interpreted  as  one  which  writes  the  original  value  back  into  the  data 


object  Le.  f.  is  the  identity  function. 

3.3. 3. 2. 2  Consistency  and  Correctness 


Due  to  the  partial  ordering,  the  ordering  between  some  steps  in  the  transaction  is  unspecified.  The  results 
produced  by  any  total  ordering  that  is  consistent  with  the  partial  ordering  in  the  transaction  must  be  equally 
valid.  Otherwise,  one  should  specify  the  order. 


Definition:  A  single  level  transaction  T^  is  the  linearization  of  the  nested  transaction  TN,  if  and  only  if 
has  the  same  steps  as  and  if  the  the  total  ordering  of  steps  in  T®  is  consistent  with  the  partial  ordering  of 
steps  in  TN.  That  is. 


[V(tX  (t  €  Tf)  •  (t  €  TN)  )I  A  (V(  (t*.  tm  €  TN)  A  (tm  >  t^K^,  tm  6  Tf)  A  (tm  >  g  )1 


Definition:  A  nested  transaction  is  said  to  be  consistent  and  correct,  if  and  only  if  each  of  the  linearizations 
of  the  nested  transaction,  when  executed  alone  satisfies  our  three  assumptions  about  a  transaction:  it  ter¬ 
minates  (Al),  preserves  the  consistency  of  the  database  (A2),  and  produces  correct  rcsults(A3).  In  the  follow¬ 
ing,  we  limit  our  investigation  only  to  consistent  and  correct  nested  transactions. 
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Wc  now  define  the  notion  of  a  nested  transaction  system  and  its  schedules. 


Definition:  A  nested  transaction  system  TN  =  {T^ . T*  }  is  a  finite  set  of  nested  transactions  operating 

upon  the  shared  database  D. 

Definition:  Let  T?  be  the  set  of  all  the  linearizations  of  nested  transaction  TN  €  TN.  Let  Ts  be  a  linearized 
transaction  system  forTN.  That  is,  Is  =  {T^, ....  is  a  transaction  system  in  which  €  T?,  i  =  1  to  m.  Let 
T5  be  the  set  of  all  the  linearized  transaction  systems  for  TN.  A  schedule  z  for  a  nested  transaction  system  TN 
is  a  a  schedule  of  a  linearized  transaction  system  Ts  €  T5. 

Theorem  7:  Setwise  serial  schedules  of  a  nested  transaction  system  are  consistent  and  correct 

Proof:  By  definition,  each  of  the  linearizations  of  a  nested  transaction,  when  executing  alone  and  when  the 
database  being  initially  consistent  terminates,  preserves  the  consistency  of  the  database  and  produces  correct 
results.  It  follows  from  Theorem  5  that  a  setwise  serial  schedule  for  a  linearized  nested  transaction  system  is 
consistent  and  correct  This  is  true  for  all  the  linearizations  of  the  given  nested  transaction  system.  It  follows 
that  setwise  serializable  schedules  for  nested  transaction  systems  are  consistent  and  correct  □ 

Corollary  7:  Setwise  serializable  schedules  for  nested  transactions  are  consistent  and  correct 

To  implement  a  setwise  two  phase  lock  for  a  nested  transaction,  the  principle  is  to  ensure  that  the  setwise 
two  phase  lock  protocol  is  observed  among  transactions  while  permitting  internal  lock  passing  within  a  nested 
transaction.  This  can  be  done  by  following  Moss’  lock  passing  method  [Moss  81].  Each  sub-transaction 
follows  the  setwise  two  phase  lock  protocol.  However,  locks  released  by  sub-transactions  are  retained  by  their 
parent.  These  locks  can  be  acquired  by  other  sub-transactions  under  that  parent,  but  not  by  other  trans¬ 
actions.  After  the  parent  releases  any  lock  on  an  atomic  data  set,  none  of  its  children  can  acquire  any  new  lock 
on  this  atomic  data  set.  A  given  level  L  in  a  nested  transaction  is  said  to  be  the  top  level  for  ADS  A  if  level  L 
does  not  pass  locks  on  J.  to  higher  levels  and  if  the  locks  on  JL  directly  acquired  at  level  L  plus  those  retained 
from  lower  levels  constitute  the  complete  set  of  locks  on  ADS  A  Data  objects  in  an  atomic  data  set  can  be 
unlocked  only  at  the  top  level  with  respect  to  this  atomic  data  set. 

3. 3.3. 3  Compound  Transactions 

The  setwise  serializable  scheduling  rule  does  not  use  any  semantic  information  to  guide  the  partition  of 
individual  transactions.  It  takes  a  transaction  and  partition  its  steps  into  transaction  ADS  segments,  inde¬ 
pendent  of  the  semantics  of  the  transaction.  To  obtain  a  higher  degree  of  concurrency,  the  semantic  infor¬ 
mation  of  one's  own  transaction  must  be  utilized  in  the  scheduling  process.  Generalized  setwise  serializable 
scheduling  rules  are  a  family  of  modular  scheduling  rules  designed  for  this  purpose.  These  rules  are 


represented  by  the  new  transaction  syntax  called  compound  transactions.  In  other  words,  users  of  this  family 
of  rules  must  carefully  study  their  own  transactions  and  try  to  express  their  transactions  in  the  form  of 
compound  transactions. 


In  a  compound  transaction,  steps  arc  partitioned  into  equivalent  classes  called  elementary  transactions,  each 
of  which  terminates,  preserves  the  consistency  of  the  database  and  produces  results  satisfying  its  own  post¬ 
condition.  In  a  compound  transaction,  elementary  transactions  are  partially  ordered,  and  the  conjunction  of 
the  post-conditions  of  the  elementary  transactions  must  be  equivalent  to  the  post-condition  of  the  compound 
transaction.  Once  a  transaction  is  expressed  in  the  form  of  compound  transactions,  each  of  the  elementary 
transactions  in  a  compound  transaction  can  be  further  partitioned  into  transaction  ADS  segments.  A  schedule 
z  is  said  to  be  generalized  setwise  serializable  if  under  z  all  the  transaction  ADS  segments  of  all  the  elementary 
transactions  in  the  system  are  executed  serializably.  In  this  section,  we  define  the  notion  of  compound 
transactions  and  prove  that  generalized  setwise  serializable  schedules  are  consistent  and  correct 


Before  the  development  of  a  formal  model,  we  would  like  to  illustrate  the  concepts  with  a  simplified 
examples  of  resource  management. 

3.3. 3.3.1  Consistency  Preserving  Partition  of  Transactions  •••  An  Example 

Suppose  that  a  distributed  computer  system  consists  of  nodes  having  various  resources.  These  resources  are 

described  by  counter  variables  which  indicate  the  units  of  various  resources  available  and  lists  which  describe 

the  units  loaned  to  various  processes.  For  simplicity,  we  only  consider  a  single  type  of  resource  at  each  node. 

The  counter  variable  and  the  list  at  each  node  form  an  atomic  data  set  with  consistency  constraint  requiring 

the  sum  of  the  units  of  the  available  resources  and  the  loaned  resources  to  be  a  constant.  Let  the  counter 

variable  and  the  list  at  node  i  be  K.  and  L;  respectively.  Consider  a  transaction,  T.,  which  attempts  to  obtain 

one  unit  of  resources  at  both  nodes  1  and  2,  or  none  at  all.  Without  using  the  idea  of  compound  transaction, 

we  code  T.  in  the  form  of  a  nested  transaction.  To  illustrate  the  locking  protocol  we  write  the  following 

pseudo-code  in  which  sub-transactions  are  written  redundantly  and  in  line. 

Nested  Transaction  T, 

Data  Objects:  K1#  Kz;  Lj.  L2;  ? 

BeginTransaction 
BeginSerial 
Beg inParal lei 
WriteLock  Kji 
WriteLock  K2; 

EndParallel  ; 

if  not  ((IL  >  0)  and  (K2  >  0))  then 
BeginParal lei 
Unlock  Kjt 
Unlock  K,; 

EndParallel 

else 
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BeginParallel 

BeginSerial 

Sub-Transaction  GetResourcel 
Bag inSa rial 

Kj  :*  K,  -  1; 

WriteLock  Lt; 

Update  L^; 

EndSerial;  {end  of  sub-transaction} 
Unlock  L.; 

Unlock  K2; 

EndSarial; 

BeginSarial 

Sub-Transaction  GetResources2 
BeginSerial 
K2  :»  Kz  -  1; 

WriteLock  L2; 

Update  L2; 

Endserial;  {end  of  sub-transaction) 
Unlock  K2; 

Unlock  L2; 

EndSarial ; 

EndParallel; 

EndSarial ; 

EndTransaction. 


This  provides  a  higher  degree  of  concurrency  than  that  permitted  by  a  serializable  schedule  because  locks  on 
each  atomic  data  set  are  released  as  soon  as  the  operations  on  each  set  are  done,  even  if  the  transaction  has  not 
obtained  all  the  locks.  However,  such  an  approach  may  not  provide  enough  concurrency  when  the  com¬ 
munication  delay  among  nodes  is  large  and  when  the  transaction  tries  to  get  resources  from  many  different 
nodes.  This  is  because  the  transaction  must  obtain  all  the  locks  on  Kjt  i  =  1  to  n,  before  it  can  decide  if  it  can 
proceed.  This  could  block  the  system  resource  allocation  activity  for  a  significant  amount  of  time. 


Fortunately,  the  degree  of  concurrency  can  be  markedly  increased  by  rewriting  T^  as  a  compound  trans¬ 
action.  For  the  puipose  of  illustrating  the  locking  protocol,  we  write  the  following  pseudo-code  in  which 
elementary  transactions  are  written  redundantly  and  in  line. 

Compound  Transaction  T, 

Data  Objects:  Kt.  K2,  Lj.  L2; 

Local  Variables:  ObtainResourcel ,  0btainResource2; 

BeginTransaction 

BeginSerial 

BeginParallel 

Elementary  Transaction  GetResourcel 
BeginSerial 

ObtainResourcel  :■  false; 

WriteLock  K.; 
if  K,  >  0  then 


K,  :»  Kj  -  1: 

ObtainResourcel  :■  true; 

WriteLock  Lx; 

Update  Lt; 

Unlock  Lj*. 

EndSerial ; 

Unlock  Kt; 

EndSerial;  {end  of  elementary  transaction) 

Elementary  Transaction  GetResource2 
BeginSerial 

0btainResource2  :*  false; 

WriteLock  K,; 

If  Kz  >  0  then 
BeginSerial 
Kz  ;■  K2  -  1; 

0btainResource2  :■  true; 

WriteLock  L2; 

Update  L2; 

Unlock  L2; 

EndSerial; 

Unlock  K2; 

EndSerial;  {end  of  elementary  transaction) 

EndParal lei ; 

BeginParal lei 

Elementary  Transaction  ReturnResourcel 
BeginSerial 

If  (ObtainResourcel)  and  not  (0btainResource2)  then 
BeginSerial 
WriteLock  K.; 

K.  :»  K.  +  I; 

WriteLock  Lt; 

Update  L.; 

Unlock  K}; 

Unlock  L.; 

EndSerial ; 

EndSerial;  {end  of  elementary  transaction) 

Elementary  Transaction  ReturnResourceZ 
BeginSerial 

if  (0btainRe$ource2)  and  not  (ObtainResourcel)  then 
BeginSerial 
WriteLock  K,; 

K2  :•  K2  +  1; 

WriteLock  L2; 

Update  L2; 

Unlock  K2; 

Unlock  L?; 

EndSerial ; 

EndSerial;  {end  of  elementary  transaction) 

EndParal lei ; 

EndSerial ; 

EndT ransaction. 
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Note  that  in  this  example,  each  elementary  transaction  follows  the  setwise  two  phase  lock  protocol  but  the 
the  compound  transaction  docs  not.  Since  the  compound  transaction  violates  the  setwise  two  phase  lock 
protocol,  we  cannot  use  Corollary  S  to  conclude  that  it  wilt  maintain  the  consistency  of  the  database  and 
produce  correct  results.  Possible  inconsistency  or  incorrectness  seemingly  could  arise,  because  certain  data 
objects  used  by  a  compound  transaction  could  be  modified  by  other  transactions  during  the  execution  of  the 
compound  transaction.  For  example,  the  unlocking  and  relocking  of  K1  and  1^  permits  other  transactions  to 
assign  any  arbitrary  but  consistent  values  to  K1  and  Lj  during  the  execution  of  the  compound  transaction  T. 
Nevertheless,  the  consistency  and  correctness  of  a  compound  transaction  follows  from  our  consistency  and 
correctness  assumptions  about  elementary  transactions  as  well  as  our  new  assumption  that  a  compound 
transaction  produces  correct  results  if  each  of  its  elementary  transactions  produces  correct  results.  We  formal¬ 
ize  these  ideas  as  follows. 

3.3.3. 3. 2  Syntax 

We  can  visualize  a  compound  transaction  as  being  a  tree  with  the  nodes  being  sub-compound  transactions 
and  the  leaves  being  elementary  transactions.  Each  elementary  transaction  has  the  structure  of  a  nested 
transaction. 


Definition:  A  compound  transaction  is  a  partially  ordered  set  of  elementary  transactions  defined  as  follows. 
<CompoundTransaction> ::  =  BeeinTransaction  <CompoundTransactionBody>  EndTransaction. 


<CompoundTransactionBody>  ::  =  BeginSerial  <SubCompoundTransactionList>  EndSerial  | 


<SubCompoundTransactionList>  ::=  <ElementaryTransaction>  |  <CompoundT ransactionBody>  | 
<SubCompoundT  ransactionList>  ;<SubCompoundT  ransactionUst> 

<ElementaryTransaction>  ::=  NestedT  ransactionBody1. 

3.3.3. 3. 3  Consistency  and  Correctness 

Having  defined  the  syntax  of  a  compound  transaction,  we  must  consider  a  system  of  compound  transactions 
and  determine  the  set  of  schedules  which  are  consistent  and  correct 


Assumption:  Each  elementary  transaction,  terminates  (Al),  preserves  the  consistency  of  the  database  (A2) 
and  satisfies  its  post-condition  (A3)  when  executing  alone  and  when  the  database  is  initially  consistent 


Definition:  The  post-condition  of  a  compound  transaction  is  equivalent  to  the  conjunction  of  the  post¬ 
conditions  of  its  elementary  transactions. 

Definition:  A  compound  transaction  system  1*  =  {  T^,  T^. ....  }  is  a  finite  set  of  compound  transactions 

operating  on  database  D. 

Definition:  A  schedule  z  of  a  compound  transaction  system  1*  is  a  totally  ordered  set  of  all  the  steps  in  T\ 
such  that  the  ordering  of  steps  of  each  compound  transaction  T®,  i  =  1  to  n,  in  the  schedule  is  consistent  with 
the  partial  ordering  of  these  steps  in  transaction  T®,  i  =  1  to  n. 

Let  >  t^  denote  that  step  t^  is  executed  after  ^  m. 

I  V(tX(t«*)  —  (tcT®)  )I  A  [Va^Vat.^  €  Tf)  A  (tu>ty))((ty,tu€l)  A  (tk>tu)  )I 

Definition:  An  elementary  transaction  system  T*  is  said  to  be  associated  with  the  compound  transaction 
system  T*  if  and  only  if, 

V(Tp((T*€Te)~(Tf  €!«)) 

where  T*  is  an  elementary  transaction  of  T*. 

Definition:  A  schedule  of  a  compound  transaction  system  is  said  to  be  generalized  setwise  serializable  if  and 
only  the  associated  elementary  transaction  system  is  setwise  serializable. 

Theorem  8:  Generalized  setwise  serializable  schedules  are  consistent  and  correct 

Proof:  Since  the  schedule  is  setwise  serializable  with  respect  to  all  the  elementary  transactions  in  the  system, 
it  follows  from  Corollary  S  and  the  definition  of  elementary  transactions  that  each  elementary  transaction 
terminates,  preserves  the  consistency  of  the  database  and  produces  results  that  satisfy  its  post-condition. 
Hence,  the  consistency  of  the  database  is  preserved.  By  definition,  the  post-condition  of  each  of  the  com¬ 
pound  transactions  is  also  satisfied.  Hence,  generalized  setwise  serializable  schedules  are  consistent  and  cor¬ 
rect.  □ 

Finally,  it  follows  from  the  definition  that  generalized  setwise  serializable  schedules  can  be  implemented  by 
requiring  each  of  the  elementary  transactions  in  the  transaction  system  to  follow  the  setwise  two  phase  lock 
protocol 


3.3.4  Modularity,  Application  Independence  and  Optimality 

In  this  section,  we  first  formalize  the  important  concepts  of  "modularity”  and  "application  independence”. 
Having  set  up  the  theoretical  framework,  we  prove  that  setwise  serializable  schedules  arc  optimal  in  the  set  of 
application  independent  schedules  and  that  generalized  setwise  serializable  schedules  form  a  complete  class  in 
the  set  of  modular  schedules. 

3.3.4. 1  Modularity  and  Application  Independence 

A  transaction  facility  consists  of  a  set  of  transactions  operating  upon  a  shared  database.  For  the  remainder 
of  this  section,  we  assume  that  in  the  design  phase  the  consistency  constraints  of  the  database  are  specified, 
and  the  resulting  consistency  preserving  partition  is  determined  and  remains  fixed.  Programmers  are  then 
required  to  write  transactions  for  various  applications  that  use  the  system  database  and  observe  the  database 
consistency  constraints.  Having  written  or  modified  his  transaction,  the  programmer  must  schedule  his 
transaction  according  to  some  rule  so  that  transactions  can  be  executed  concurrently,  consistendy  and  cor* 
reedy. 

A  transaction  scheduling  rule  is  a  specification  of  the  permissible  interleaving  of  the  steps  of  a  given 
transaction  with  the  steps  of  other  transactions.  Given  a  transaction  system,  a  transaction  scheduling  rule 
partitions  the  steps  of  each  transaction  into  atomic  step  segments  that  will  be  executed  without  being 
interfered  with  by  steps  of  other  transactions.  For  example,  in  serializable  schedules,  all  the  steps  in  a  single 
transaction  are  grouped  into  a  single  atomic  step  segment.  In  setwise  serializable  schedules,  each  transaction 
AOS  segment  (steps  accessing  the  same  ADS)  is  taken  as  an  atomic  step  segment.  In  generalized  setwise 
serializable  schedules,  the  transaction  ADS  segments  in  each  of  the  elementary  transactions  are  atomic  step 
segments.  Once  the  partition  of  the  steps  in  a  transaction  has  been  specified,  one  can  use  "locks”,  "time¬ 
stamps”  or  other  protocols  to  ensure  that  steps  from  various  transactions  are  interleaved  in  such  a  way  that 
each  atomic  step  segment  will  be  executed  serializably. 

In  the  transaction  system  semantic  information  approach,  one  is  allowed  to  utilize  all  the  information  about 
the  given  transaction  system  to  schedule  each  transaction  in  the  system.  For  example,  let  the  database  D  = 
{A,  B}  with  consistency  constraints  "A  +  B  =  100”.  Suppose  that  {Tr  T2}  is  a  transaction  system  where 
=  {A  :  =  A  -1;  B :  =  B  +  1}  and  T2  =  {B  :  =  B  -  2;  A  :  =  A  +  2}.  After  examining  the  details  of  these  two 
transactions,  one  may  determine  that  the  appropriate  atomic  partition  of  TV,  i  =  1  to  2,  is  to  specify  each  step 
as  an  atomic  step  segment.  That  is,  steps  of  Tj  and  T2  can  be  interleaved  arbitrarily.  On  the  other  hand,  in 
another  related  transaction  system  {Tr  T2>  where  T*  =  (B  :  =  B  -  2;  A  :  =  100  -  B},  the  correct  specification 
requires  the  entire  transaction  T2  (T2)  to  be  treated  as  a  single  atomic  step  segment,  even  though  T2  and  T2 
are  equivalent  when  executing  alone.  That  is,  Tj  and  T2  must  be  interleaved  serializably.  In  a  large  transaction 
system,  a  transaction  system  semantic  information  approach  often  requires  users  to  first  partition  the  trans- 


action  system  into  different  sub- transaction  systems.  Kach  transaction  is  then  partitioned  into  different  forms 
each  of  which  is  suitable  for  a  given  sub-transaction  system.  For  example,  a  nested  form  of  multiple  partitions 
specified  by  "break  points"  was  suggested  in  (Lynch  83a].  In  contrast  to  the  transaction  system  semantic 
information  approach,  a  modular  approach  requires  that  the  atomic  partition  of  each  transaction  be  con¬ 
structed  independent  of  the  transaction  system,  so  that  the  modification  of  any  transaction  will  not  invalidate 
the  atomic  partition  of  another  transaction. 

Before  proceeding,  we  first  define  the  notion  of  a  scheduling  rule.  A  scheduling  rule  is  a  function  which 
takes  a  transaction  system  and  partitions  the  steps  of  each  transaction  into  equivalent  classes  called  atomic  step 
segments. 

Definition:  Let  Tm  denote  the  set  of  all  the  possible  consistent  and  correct  transactions  with  m  steps.  Let  T 
denote  the  set  of  all  the  possible  consistent  and  correct  transactions,  Le.  T  =  U^L1Tm.  Let  Pm  denote  a 
partition  of  an  m-step  consistent  and  correct  transaction  into  atomic  steps  segments.  Let  9m  denote  the  set  of 
all  the  possible  partitions  of  an  m-step  consistent  and  correct  transaction.  Let  V  be  the  set  of  all  the  possible 
partitions.  Le.  9  =  U  ® 

A  scheduling  rule  for  a  transaction  system  with  n  transactions,  Rq,  is  a  function  which  takes  the  transaction 
system  of  size  n  and  partitions  each  of  the  n  transactions, 

Ra:n;.iT-n?«i» 

A  scheduling  rule  R  is  a  function  with  takes  a  transaction  system  of  any  size  and  partitions  each  of  the 
transactions  in  the  system. 

such  that  the  restriction  of  R  to  n  7-iT  is  Rn,  Le. 

Rinr=iT  =  Rn,n  =  ltoo0 

Given  a  scheduling  rule  R,  we  must  identify  the  set  of  schedules  that  satisfy  R.  A  schedule  z  satisfies  R  if 
each  of  the  atomic  step  segments  specified  by  R  is  executed  serializably  under  z.  This  is  formalized  as  follows. 

Definition:  Let  T  =s  {T^  ....  Tn)  be  a  consistent  and  correct  transaction  system,  Le.  T  C  T.  Let  T  be  a 
transaction  in  T.  Let  Z„(T)  denote  the  atomic  partition  of  the  steps  of  T.  by  R  ,  the  restriction  of  R  to  n 
T.  Let  T  be  the  set  of  all  the  atomic  step  segments  of  T  specified  by  R.  Le.  T  =  U  T  eT  ^(Tp.  Let  D  be  the 
database  and  Z(T)  be  the  set  of  all  the  possible  schedules  for  T. 


A  schedule  z  €  /XT)  is  said  to  satisfy  R  if  and  only  if. 

Cyclc(G(z,  T.  D))  =  0, 

where  G(z,  T.  D)  is  the  precedence  graph2  for  schedule  z  with  respect  to  the  database  D  and  the  set  of  step 
segments  I*.  The  set  of  all  such  schedules  for  T  is  denoted  by  ZR(T).  That  is,  ZR(T)  =  {z  |  Cyclc(G(z,  T,  I)» 
=  0} 


Definition:  A  scheduling  rule  R  is  said  to  be  consistent  and  correct,  if  and  only  if  all  the  schedules  that 
satisfy  R  are  consistent  and  correct, 

V(T  C  T)V(z  e  ZR(T))  (z  is  consistent  and  correct) 

In  the  following,  we  limit  our  discussions  to  consistent  and  correct  scheduling  rules.  We  consider  a 
consistent  and  correct  scheduling  rule  to  be  modular,  if  it  schedules  each  transaction  independent  of  other 
transactions  in  the  system. 

Definition:  A  consistent  and  correct  scheduling  rule  R  is  said  to  be  modular  if  and  only  if  R  schedules  each 
transaction  independently,  Le. 

Rn({Tr  _  Tn})  =  (R^TJ).  _  RjffTJO),  n  =  1  to  oo. 

where  RQ  is  the  restriction  of  R  to  iT.  The  scheduling  rule  for  individual  transactions,  Rr  will  be 

referred  to  as  the  kernel  of  the  modular  scheduling  rule  R. 

We  now  turn  to  the  concept  of  an  application  independent  scheduling  rule.  Thus  far  we  have  assumed  that 
each  programmer  writes  and  schedules  his  own  transaction.  The  scheduling  is  done  with  full  knowledge  of  the 
transaction  written  but  without  specific  knowledge  of  others’  transactions.  To  further  simplify  the  scheduling 
task,  we  would  like  to  develop  scheduling  rules  that  can  be  mechanically  applied  to  all  the  transactions, 
independent  of  their  semantics.  To  this  end,  an  application  independent  scheduling  rule  must  ignore  the 
specifics  of  various  transactions  and  use  only  the  syntax  information  of  transactions,  i.e.  names  of  the  data 
objects  read  or  written  by  each  transaction.  In  other  words,  an  application  independent  scheduling  rule  views 
a  transaction  as  a  sequence  of  read  and  write  steps  without  knowing  the  computation  carried  out  by  the 
transaction. 


2Defined  in  Section  3.13.Lb 
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1  Definition:  Define  an  equivalence  relation  on  the  set  of  ail  the  consistent  and  correct  transactions  T.  Two 

I  consistent  and  correct  transactions  T;  and  T.  arc  said  to  be  equivalent  in  syntax ,  denoted  by  T  s  T.,  if  and 

only  if, 

1.  T  and  T.  have  the  same  number  of  steps. 

I  2.  If  step  k  of  T  reads  (writes)  data  object  O.  then  step  k  ofT.  reads  (writes)  the  same  data  object  O, 

for  all  k. 

*  Definition:  A  modular  scheduling  rule  R  is  said  to  be  application  independent  if  the  kernel  of  R  identically 
partitions  transactions  with  equivalent  syntaxes. 

r 

V((Tj,T.{T)A(T.sTj))(R1(Ti)  =  R^)) 

* 

Theorem  9:  Setwise  and  generalized  setwise  serializable  schedules  are  modular. 
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Proof:  First,  setwise  serializable  schedules  are  a  special  case  of  generalized  setwise  serializable  schedules. 
Second,  generalized  setwise  serializable  schedules  are  consistent  and  correct  by  Theorem  8.  Third,  a  general¬ 
ized  setwise  serializable  schedule  takes  each  transaction  separately  and  partitions  it  into  elementary  trans¬ 
actions  and  then  partitions  the  elementary  transactions  into  ADS  transaction  segments.  This  is  done  indepen¬ 
dently  for  each  transaction.  It  follows  from  the  definition  of  modular  scheduling  rules  that  both  scheduling 
rules  in  question  are  modular.  □ 

Theorem  10:  The  setwise  serializable  scheduling  rule  is  application  independent 

Proof:  A  setwise  serializable  scheduling  rule  partitions  the  steps  of  the  given  transaction  into  transaction 
ADS  segments.  A  transaction  ADS  segment  consists  of  all  the  steps  which  read  or  write  data  objects  from  the 
same  ADS.  It  follows  that  two  transaction  which  are  equivalent  in  syntax  have  the  same  partition.  It  follows 
from  the  definition  that  a  setwise  serializable  scheduling  rule  is  application  independent.  □ 

It  should  be  pointed  out  that  a  generalized  setwise  serializable  scheduling  rule  is  modular  but  not  applica¬ 
tion  independent  This  is  because  the  decomposition  of  a  transaction  into  a  collection  of  elementary  trans¬ 
actions  requires  an  understanding  of  the  details  of  the  transaction  in  question.  We  cannot  correctly  perform 
the  decomposition  using  syntax  information  alone. 


j 
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3. 3. 4. 2  Optimality  and  Completeness 

The  set  of  primitive  steps  used  in  a  given  syntax  affects  the  degree  of  concurrency  provided  by  a  scheduling 
rule.  The  primitive  steps  defined  in  our  transaction  syntax  arc  the  conventional  two:  "read"  and  "write".  It 
has  been  shown  that  for  serializable  schedules  the  concurrency  can  be  improved  if  the  set  of  primitive  steps  is 
expanded  to  include  other  commutative  ones  [Korth  83].  The  idea  is  that  if  a  set  of  steps  is  commutative,  then 
there  is  no  need  to  control  their  relative  order.  For  example,  read  steps  are  commutative  with  each  other,  as 
are  unconditional  add  steps.  For  a  full  treatment  of  this  subject,  readers  are  referred  to  [Korth  83].  In  the 
following,  we  limit  our  discussion  to  transactions  using  only  primitive  steps:  "read"  and  "write".  The  use  of 
commutative  steps  to  improve  the  concurrency  of  (generalized)  setwise  serializable  schedules  can  be  done  in  a 
manner  similar  to  that  done  by  Korth  for  serializable  schedules. 

We  begin  our  investigation  by  first  defining  a  way  to  compare  the  degree  of  concurrency  offered  by 
different  scheduling  rules. 

Definition:  Scheduling  rule  R1  is  said  to  be  at  least  as  concurrent  as  R2,  denoted  by  Rl  >  R2,  if  and  only  if, 

V  (T  C  TXZr^D  Q  ZrICD) 

That  is,  the  concurrency  of  schedules  is  partially  ordered  by  set  containment  The  relative  concurrency  of 
two  scheduling  rules  can  be  incomparable.  We  now  define  the  notion  of  optimal  application  independent 
scheduling  rules. 

Definition:  Let  Aa  be  the  set  of  all  the  application  independent  scheduling  rules.  An  application  inde¬ 
pendent  scheduling  rule  R*  €  Aa  is  said  to  be  optimal  if  R*  is  at  least  as  concurrent  as  any  rule  in  Aa.  That  is, 

V(R€Aa)(R*>R) 

We  now  prove  that  the  setwise  serializable  scheduling  rule  is  optimal  in  the  set  of  application  independent 
scheduling  rules.  The  key  to  the  proof  is  to  show  that  if  a  modular  scheduling  rule  R  partitions  a  transaction 
ADS  segment  <j  into  two  or  more  atomic  step  segments,  then  there  exists  some  schedule  z  satisfying  R  such 
that  the  inputs  to  o  under  z  cannot  come  from  a  single  consistent  state. 

Lemma  11-1:  Let  X  =  {Oj,  ..„  Om)  be  an  ADS  from  the  maximal  consistency  preserving  partition  of  D. 
Let  the  index  set  of  -L  I,  be  partitioned  into  two  sets  S1  and  S^.  Let  U  be  the  universal  set  of  the  consistent 
states  of  JL 

3(WeU  A  XcU  a  Y«U)(0r  (W)  =  *  (Y))  A  (w  (X)  =  w  (Y)) ) 

*1  S1  *2  *2 
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Proof:  Suppose  that  the  claim  is  false.  By  the  definition  of  a  consistency  preserving  partition,  {S,.  S2l  is  a 
consistency  preserving  partition  of  I.  This  contradicts  the  assumption  that  D  is  maximally  partitioned.  □ 


Lemma  11-2:  Let  database  D  be  maximally  partitioned  into  atomic  data  sets.  Let  T;  be  a  consistent  and 
correct  transaction  operating  upon  D.  Let  a  be  a  transaction  ADS  segment  in  T.  Let  the  ADS  accessed  by  a 

be  A.  Suppose  that  a  is  partitioned  into  atomic  step  segments  al . <xk,  k  >  1  by  some  modular  scheduling 

rule  R.  Then  there  exist  a  transaction  system  T  €  T  and  a  schedule  z  €  ZR(T)  such  that  under  z  the  values 
input  to  a  are  not  projections  from  any  single  consistent  state  of  X 

Proof:  Let  -ADS0T)  =  {<rr  <rt}  be  the  ADS  atomic  partition  of  T.  That  is,  each  element  of  the 
partition  is  a  transaction  ADS  segment.  Suppose  that  a  modular  scheduling  rule  R  partitions  transaction  ADS 
segment  into  au. ....  j  >  1  In  the  following,  we  prove  that  if  j  =  2  then  the  inputs  to  a  cannot  come 
from  a  single  consistent  state  of  A.  If  j  >  2,  we  merge  ...,  into  a  single  atomic  step  segment  a2  and 
then  use  the  result  for  j  =  2.  There  can  be  two  cases. 

Case  1:  and  <t^  access  disjoint  sets  of  data  objects  in  ADS  A.  Let  I  be  the  index  set  of  A.  Partition  I 

into  S:  and  S2  such  that  <r  accesses  only  data  objects  indexed  by  S.,  i  =  1  to  2. 

Let  Tw  and  Tx  be  two  transactions  €  T,  assigning  consistent  states  W  and  X  to  A  respectively. .  Let  Wx  = 
w^(W)  and  -  »^(X).  By  Lemma  11-L  W1  and  Xj  could  be  the  projection  of  an  inconsistent  state  Y.  We 
assume  this  is  the  case.  The  schedule  z  =  {Tx.  <ru,  Ty,  <xu.  _  ak)  satisfies  R  because  each  of  the  atomic 
step  segments  specified  by  R  is  executed  serializabiy  under  z.  Note  that  the  inputs  to  a  are  projections  from 
the  inconsistent  state  Y. 

Case  2:  and  share  at  least  one  data  object.  Let  a  shared  data  object  be  O.  As  discussed  in  case  L 

the  schedule  z  =  {Tx,  ou,  Ty,  ou, ...,  aj  satisfies  R.  Let  the  values  of  O  assigned  by  Tx  and  Ty  be  x  and  y 
respectively.  Hence,  under  z  there  are  two  steps  accessing  0,  inputting  values  from  two  different  states  of  A. 

It  follows  that  R  cayot  guarantee  all  the  inputs  to  ^  are  projections  from  a  single  consistent  state  of  A.  □ 

• 

Theorem  11:  The  setwise  serializable  scheduling  rule  is  optimal  in  the  set  of  application  independent 
scheduling  rules. 

Proof:  Let  the  setwise  serializable  scheduling  rule  be  R*.  Let  R  be  any  other  application  independent 
scheduling  rule.  Let  T  be  any  consistent  and  correct  transaction.  Let  the  database  be  maximally  partitioned 
into  atomic  data  sets.  Let  a  be  a  transaction  ADS  segment  of  T;  accessing  ADS  A.  We  examine  the  atomic 
partition  ofT  by  R.  If  any  ADS  segment  a  of  T{  is  partitioned,  then  by  Lemma  11-1  the  inputs  to  a  cannot 
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be  guaranteed  to  be  the  projections  of  a  single  consistent  state  of  J~  Since  R  identically  partitions  all  the 
transactions  equivalent  to  T  in  syntax,  we  can  interpret  the  semantics  of  T  as  follows.  T  produces  correct 
results  if  and  only  if  the  input  values  to  a  arc  consistent.  i.c.  projections  from  a  single  consistent  state  of  X 
Hence,  if  any  transaction  ADS  segment  is  partitioned  by  R,  R  will  be  incorrect.  It  follows  that  the  atomic  step 
segments  of  T  specified  by  R  can  only  be  either  transaction  ADS  segments  or  the  super  sets  of  transaction 
ADS  segments.  This  result  applies  to  any  transaction  T  in  any  transaction  system  T  C  T.  It  follows  that  any 
schedule  z  satisfying  R  satisfy  R  .  Therefore,  R  is  at  least  as  concurrent  as  R.  □ 

We  now  investigate  the  completeness  issue  of  generalized  setwise  serializable  scheduling  rules.  Associated 
with  each  transaction,  there  is  the  specification  of  input  steps,  output  steps  and  the  relationship  among  input 
values  and  output  values  in  the  form  of  post-conditions.  The  transaction  must  be  written  in  such  a  way  that 
the  consistency  of  the  database  is  preserved  and  the  post-condition  is  satisfied  when  executing  alone. 
However,  in  many  cases  the  isolated  execution  environment  is  only  a  sufficient  condition.  We  have  shown 
that  if  we  are  able  to  partition  the  steps  of  a  transaction  into  elementary  transactions  and  then  partition  the 
steps  of  each  elementary  transaction  into  transaction  ADS  segments,  the  consistency  of  the  database  is 
preserved  and  the  specified  post-condition  is  satisfied. 

We  have  developed  the  syntactic  structure  of  compound  transactions  to  support  users  to  form  such  modular 
but  application  dependent  atomic  partition  of  transactions.  The  question  remaining  to  be  answered  is 
whether  there  exists  another  form  of  partitioning  a  given  transaction  that  would  be  modular  and  lead  to  a 
higher  degree  of  concurrency.  To  answer  this  question,  we  introduce  the  notion  of  completeness.  We  say  that 
the  generalized  setwise  serializable  scheduling  rules  form  a  complete  class  within  the  set  of  modular  schedul¬ 
ing  rules.  This  means  that  given  any  modular  scheduling  rule  R  we  can  always  find  a  generalized  setwise 
serializable  scheduling  rule  R*  such  that  R*  is  at  least  as  concurrent  as  R.  Hence,  a  programmer  who  is 
interested  in  modular  scheduling  rules  providing  a  high  degree  of  system  concurrency  needs  to  look  no 
further  than  the  class  of  generalized  setwise  serializable  scheduling  rules.  All  he  has  to  do  is  to  maximally 
partition  his  transaction  into  elementary  transactions.  Once  this  is  done,  each  of  the  elementary  transactions 
can  be  mechanically,  partitioned  into  transaction  ADS  segments. 

Definition:  Let  AM  be  the  set  of  all  the  modular  scheduling  rules.  A  set  of  scheduling  rules  %  is  said  to 
form  a  complete  class  within  AM,  if  and  only  if 

V(R€AM)[3(R*€9,XR*>R)] 

Before  proceeding  with  the  proof  of  completeness,  we  need  to  introduce  the  notion  of  the  post-condition 
associated  with  an  atomic  step  segment.  For  example,  let  the  transaction  T;  =  {L  ^  A  :  =  A  - 1;  A  :  =  A 


+  1}.  If  the  two  steps  of  T  arc  treated  as  a  single  atomic  step  segment,  then  t ,  is  an  input  step  and  ti2  is  an 
output  step.  The  partition  of  a  transaction  could  create  input  and  output  steps  in  addition  to  those  defined  in 
an  executing  alone  environment.  For  example,  if  each  of  these  two  steps  is  an  atomic  step  segment,  then  1 1 
(t^)  is  both  an  input  step  and  an  output  step. 

Definition:  Let  a  =  {tL1, ....  tu>  be  an  atomic  step  segment.  Let  the  data  object  accessed  by  step  t  €  a  be 
O.  Step  tjj  is  an  input  step  if  it  is  the  step  in  <x  first  accessing  0.  Step  t^  is  an  output  step  if  it  is  the  step  in  a 
last  accessing  O. 

Definition:  Let  0.[v.],  j  =  1  to  k,  be  the  values  input  to  the  input  steps  of  a  and  O .[v^,  j  =  1  to  k,  be  the 
values  output  by  the  output  steps  of  a.  The  post-condition  of  <x  is  a  specification  of  the  output  values  as 
functions  of  input  values  when  a  is  executing  alone. 

Oj[V(]  =  ffOJv^  ....  Ok[v.D,  j  =  1  to  k. 

We  now  prove  that  generalized  setwise  serializable  scheduling  rules  form  a  complete  class. 

Lemma  12:  A  modular  scheduling  rule  R  is  consistent  and  correct  if  and  only  if  for  each  of  the  transactions 
TjinT. 

1.  Each  atomic  step  segment  in  T  specified  by  R  preserves  the  consistency  of  the  database  when 
executing  alone. 

2.  The  conjunction  of  the  post-conditions  of  all  the  atomic  step  segments  in  T  specified  by  R  is 
equivalent  to  the  post-condition  associated  with  T. 

Proof:  First,  if  any  atomic  step  segment  a  in  T;  specified  by  R  does  noc  preserve  the  consistency  of  the 
database  when  executing  alone,  then  another  transaction  T  executing  after  a  would  input  an  inconsistent 
state.  Since  R  is  modular,  we  can  define  the  semantics  of  as  one  that  outputs  incorrect  results  when  its  input 
is  inconsistent  Thus  R  is  incorrect  Second,  if  the  conjunction  of  the  post-conditions  of  all  the  atomic  step 
segments  in  T  specified  by  R  is  not  equivalent  to  the  post-condition  associated  with  Tjt  then  R  is  incorrect  by 
definition.  Since  any  schedule  z  for  any  transaction  system  T  C  T  satisfying  R  guarantees  that  each  of  the 
atomic  step  segments  will  be  executed  serializably,  it  follows  from  condition  1  and  2  that  z  is  consistent  and 
correct  □ 

Theorem  12:  Generalized  setwise  serializable  scheduling  rules  form  a  complete  class  within  the  set  of 
modular  scheduling  rules. 

Proof:  Let  R  be  any  modular  scheduling  rule.  Suppose  that  T;  is  a  consistent  and  correct  transaction  and  T. 
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is  partitioned  into  atomic  step  segments  <Xj, ....  ok  by  R.  First,  by  Lemma  12  ojt  i  =  1  to  k.  must  preserve  the 
consistency  of  the  database  when  executing  alone.  Second,  by  Lemma  12  the  conjunction  of  the  post¬ 
conditions  of  ....  must  be  equivalent  to  the  post-conditions  associated  with  T.  Note  that  each  i  =  1 
to  k.  satisfies  the  definition  of  an  elementary  transaction.  Define  a  generalized  setwise  serializable  scheduling 
rule  R*  which  partitions  Tj  as  follows.  First,  R*  labels  cr as  elementary  transactions.  Next,  R  partitions 
these  elementary  transactions  into  transaction  ADS  segments.  Hence,  R  is  at  least  as  concurrent  as  R.  □ 

3.3.5  Conclusion 

The  very  nature  of  a  distributed  system  provides  us  with"  the  opportunity  to  realize  a  very  high  degree  of 
concurrency.  The  desire  to  realize  a  higher  degree  of  concurrency  than  that  permitted  by  serializable 
schedules  has  motivated  computer  scientists  to  develop  non-serializable  concurrency  control  methods. 
However,  a  distributed  computer  system  is  typically  very  complex,  written  and  maintained  by  many  program¬ 
mers  over  a  period  of  years.  Therefore,  it  is  important  to  develop  a  modular  approach  to  non-serializable 
concurrency  control.  In  this  approach,  programmers  are  permitted  to  write,  modify  and  schedule  their 
transactions  independent  of  each  other.  We  have  defined  a  new  type  of  transaction  syntax  called  compound 
transactions  and  its  associated  schedules  called  generalized  setwise  serializable  schedules.  The  classical  single 
level  transaction  and  nested  transactions  are  special  cases  of  compound  transactions.  Serializable  schedules 
are  special  cases  of  generalized  setwise  serializable  schedules.  We  have  shown  that  generalized  setwise  serializ¬ 
able  schedules  are  consistent,  correct  and  modular. 

In  addition,  generalized  setwise  serializable  schedules  form  a  complete  class  within  the  set  of  modular 
schedules.  This  means  that  for  any  given  modular  scheduling  rule  R,  there  exists  a  generalized  setwise 
serializable  scheduling  rule  which  is  at  least  as  concurrent  as  R.  Hence,  users  who  are  interested  in  providing 
a  high  degree  of  system  concurrency  need  look  no  further  than  generalized  setwise  serializable  schedules.  An 
important  special  case  of  generalized  setwise  serializable  scheduling  rules  is  the  setwise  serializable  scheduling 
rule.  We  have  shown  that  the  setwise  serializable  scheduling  rule  is  optimal  in  the  set  of  all  application 
independent  scheduling  rules.  This  rule  can  be  "mechanically”  applied  to  schedule  any  transaction  without 
knowing  its  semantics.  These  optimality  results  are  proven  under  the  assumption  that  the  only  primitive  steps 
used  in  the  transaction  syntax  are  "read”  and  "write”.  The  concurrency  of  (generalized)  setwise  serializable 
schedules  can  be  improved  by  developing  families  of  commutative  steps  appropriate  to  one’s  application. 
This  can  be  done  in  a  way  similar  to  that  done  by  Korth  [Korth  83]  for  serializable  schedules. 

Finally,  an  important  issue  mendoned  but  not  addressed  in  this  paper  is  the  principle  of  designing  the 
consistency  constraints  for  the  database  embedded  in  a  distributed  operaung  system.  Generally  speaking, 
using  a  set  of  consistency  constraints  that  is  weaker  than  the  corresponding  ones  in  the  centralized  operaung 
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system  permits  a  higher  degree  of  concurrency.  However,  once  consistency  constraints  arc  weakened,  the 
complexity  of  transactions  will  be  increased.  We  believe  that  the  study  of  the  principles  of  designing  the 
consistency  constraints  for  a  distributed  operating  system  in  general  and  the  evaluation  of  the  trade-offs 
between  system  concurrency  and  transaction  complexity  in  particular  is  an  exciting  new  area  of  research. 

3.4  Distributed  Cooperating  Processes  and  Transactions 

3.4.1  Co-operating  Processes 

3. 4. 1.1  A  New  Formulation 

The  synchronization  of  co-operating  processes  is  an  important  aspect  of  an  operating  system.  When  the 
processes  are  physically  dispersed,  classical  centralized  techniques  are  usually  not  cost-effective.  Our  model 
of  data  consistency  (unlike  the  serialization  model)  is  able  to  handle  this  because  the  relationships  among 
distributed  co-operating  processes  are  represented  as  partially  dependent  relations  among  the  state  variables 
of  co-operating  processes.  The  synchronization  of  co-operating  processes  is  thus  defined  as  the  maintenance 
of  these  dependency  relations. 
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According  to  this  model  co-operating  processes  generally  have  two  phases  --  an  autonomous  phase  and  a 
dependent  phase.  In  the  autonomous  phase,  the  state  variables  of  the  co-operating  processes  take  on  values 
that  belong  to  the  set  of  the  cartesian  products  of  the  subsets  of  the  domains  of  these  state  variables.  For 
example,  let  the  domains  of  the  state  variables  of  processes  P1  and  ?2  both  be  {0.1.2.3},  and  let  the  relation 
between  them  be  {  {0,  1}  x  {0,  1},  <2,  2>,  <3,  3>  }.  That  is,  processes  Px  and  P2  can  change  their  states 
autonomously,  as  long  as  their  state  variables  on  values  from  the  set  of  the  cartesian  products 
{  {0. 1}  x  {0, 1}  }. 

In  the  dependent  phase,  all  state  variables  in  a  process  must  take  on  values  according  to  the  data  invariants 
--  e.g.,  the  state  variables  of  P2  and  P2  above  must  both  have  values  of  either  2  or  3.  The  problem  of  ensuring 
that  a  set  of  processes,  e.g.,  P:  and  Py  will  enter  their  dependent  (e.g.,  identical)  states  is  a  matter  of 
maintaining  the  data  invariants  "P1  =  P2,  2  <  Pj,'  P2  <  3".  This  can  be  done  by  requiring  that  the 
manipulation  of  the  state  variables  of  processes  P2  and  P2  satisfy  the  conformity  condition. 


In  the  autonomous  phase,  there  are  no  data  invariants  among  the  state  variables  of  processes  PL  and  P2  to 
be  maintained,  thus  it  is  possible  to  allow  these  two  processes  to  maintain  a  probabilistic  relationship  among 
their  state  variables.  This  can  be  accomplished  by  assigning  a  joint  probability  distribution  over  the  set  of 
cartesian  products  of  the  processes’  state  variables.  From  this  joint  distribution,  we  can  derive  conditional 
distributions  to  interpret  the  probabilistic  relationships  among  the  states  of  processes  co-operating  in  the 
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autonomous  phase,  [n  practice,  one  often  designs  a  probabilistic  algorithm,  observes  the  induced  probability 
distribution,  and  iterates  on  the  design  until  the  resulting  distribution  is  satisfactory.  For  example,  we  can 

have  the  following  conditional  distributions  regarding  processes  P2  and  P2. 

P[p,-0  |  p,«0]  ■  0.8,  P[p,«0  |  p.«l]  -  0.2, 

P[P2-1  I  Pj-0]  -  0.2.  P[p2-1  |  pj-l]  -  0.8 

This  can  be  interpreted  as  ?l  requesting  P2  to  be  in  the  same  state  as  Pr  and  although  P2  is  not  obligated  to 
honor  P^s  request,  P2  does  give  P^s  request  favorable  consideration.  Therefore,  when  P1  is  in  state  0  (or  1). 
P2  is  likely  to  be  in  state  0  (or  1). 

The  need  for  probabilistic  co-operation  often  arises  due  to  the  communication  delays  in  physically  dis¬ 
persed  systems.  It  may  be  less  expensive  to  maintain  certain  relationships  among  data  objects  indeterminis¬ 
tically  and  recover  when  necessary,  than  to  force  those  relationships  to  always  be  deterministic. 

We  now  turn  to  the  subject  of  phase  transitions.  The  transition  from  the  autonomous  phase  to  the  depend¬ 
ent  phase  requires  the  establishment  of  a  dependency  relationship  among  state  variables.  Since  dependency 
relationships  are  defined  on  version  numbers,  their  establishment  includes  equalizing  the  version  numbers  of 
each  state  variable  (for  instance,  by  resetting  them  to  zero),  and  assigning  appropriate  values  to  the  state 
variables.  In  general,  a  state  transition  is  carried  out  in  three  stages.  First,  if  there  is  more  than  one  process 
requesting  that  the  transition  be  made,  one  of  the  requesting  processes  is  selected.  Next,  all  of  the  co¬ 
operating  processes  must  be  instructed  to  complete  (or  abort)  any  current  outstanding  autonomous  manipula¬ 
tion  of  state  variables,  and  not  to  initiate  further  autonomous  manipulation.  Finally,  values  must  be  assigned 
to  each  of  the  state  variables  according  to  the  selected  processes’  requirements,  and  the  version  numbers  of 
the  state  variables  must  be  reset. 

The  transition  of  processes  from  the  dependent  phase  to  the  autonomous  phase  is  a  simple  matter.  Once  a 
process  obtains  the  right  to  manipulate  the  current  version  of  the  atomic  data  set,  it  can  bring  the  co-operating 
processes  to  an  autonomous  phase  by  assigning  appropriate  values  from  the  set  of  cartesian  products  to  the 
state  variables. 

Although  there  are  many  different  algorithms  to  implement  process  phase  transition  and  synchronization 
activities,  we  have  found  that  (in  a  variety  of  applications)  the  use  of  a  synchronization  path  is  a  effective 
technique.  In  the  example  above,  processes  P2  and  P2  co-operate  probabilistically  in  states  0  and  1.  Suppose 
now  that  P2  wants  P2  to  jointly  enter  state  2,  while  P2  wants  P^  to  jointly  enter  state  3.  To  resolve  such  a 
conflict,  a  synchronization  path  could  be  defined  as  follows.  Any  request  for  dependent  co-operation  must 
first  be  submitted  to  Pr  If  more  than  one  request  is  received  at  P2,  one  will  be  honored  and  forwarded  to  P2 
where  it  will  also  be  honored.  Requests  that  were  not  selected  by  P2  will  be  queued  to  be  selected  at  later 
times.  The  following  example  illustrates  the  use  of  synchronization  paths. 
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3.4.1. 2  Example:  Remote  Process  Interruption  and  Abortion 

This  example  arose  in  the  context  of  the  Spice  graphic  package.  Canvas  [Ball  82].  which  consists  of  two 
co-operating  processes  running  on  the  Accent  network  operating  system.  One  process  is  a  remote  server  while 
the  other  is  a  user  interface  process.  The  user  interface  is  local  to  the  user’s  machine  and  relays  user 
commands  to  the  remote  server  via  messages.  For  our  discussion,  we  abstract  the  user  interface  into  four 
basic  commands:  execute,  interrupt,  continue  and  abort. 

The  two  basic  requirements  for  this  task  are:  first,  it  is  desirable  to  minimize  message  traffic  between  the 
two  processes;  and  second,  the  results  of  remote  service  can  not  be  made  permanent  undl  the  user  is  informed 
that  the  job  is  done  -  that  is,  the  user  is  given  a  chance  to  abort  or  interrupt  the  remote  process  up  until  the 
point  where  he  is  notified  that  the  job  is  done.  From  an  implementation  point  of  view,  this  requirement 
implies  that  the  user's  request  should  take  precedence  when  there  is  a  conflict  between  a  remote  server  that  is 
trying  to  make  a  result  permanent,  and  a  user  who  is  trying  to  abort  (or  interrupt)  an  outstanding  server 
process. 

Initially,  a  remote  procedure  call  based  solution  was  considered  because,  intuitively,  tasks  with  a  remote 
server  seemed  to  fit  this  paradigm  well.  However,  it  was  soon  discovered  that  the  conflict  between  the  server 
process  and  the  user  made  the  remote  piocedure  call  approach  difficult  to  use.  This  is  because  in  a  remote 
procedure  call  environment,  control  is  passed  from  the  requesting  process  when  the  server  process  is  called, 
and  is  returned  when  the  server  has  completed  processing  the  request  (or  the  system  detects  that  the  server 
has  failed).  The  concept  of  asynchronously  interrupting  an  executing  server  process  is  counter  to  the  remote 
procedure  call  paradigm.  Thus,  the  problem  defined  above  cannot  be  easily  solved  with  a  classical  remote 
procedure  call  approach.  In  this  example,  the  initial  attempt  to  use  remote  procedure  calls-resulted  in  an 
overly  complex  implementation.  Furthermore,  a  remote  procedure  call  approach  also  generates  more  message 
traffic,  as  all  inquiries  must  be  forwarded  to  the  remote  server  for  a  response,  due  to  the  fact  that  the  state  of 
the  remote  server  changes  asynchronously  with  respect  to  the  state  of  the  user  server  process; 

In  general  the  remote  procedure  call  paradigm  is  appropriate  for  tasks  with  master/slave  (i.e„  hierarchical) 
control  structures,  but  it  becomes  much  less  so  for  peer  processes  having  symmetrical  control  relationships. 
An  approach  based  on  our  model  does  not  impose  such  a  restrictive  control  structure  on  the  co-operating 
processes,  and  permits  the  use  of  local  information  to  reduce  the  communication  overhead.  Let  the  state 
variable  of  the  interface  process  be  Su  and  the  state  variable  of  the  remote  server  be  Sg.  If  we  maintain  data 
invariants  in  the  form  of  "Su  =  Sg",  the  user  interface  process  can  provide  the  user  rapid  response  by  looking 
only  at  its  local  state  variable  Su.  To  ensure  that  the  user-issued  abort  and  interrupt  commands  win  any 
conflicts,  we  define  a  synchronization  path  such  that  any  command  must  first  update  the  the  state  variable  of 
the  user  interface  process. 


The  basic  states  of  the  remote  server  and  the  user  interface  processes  are  called  IDLE,  SUSPENDED,  and 
EXECUTION,  and  arc  labeled  as  state  zero,  one  and  two  respectively.  The  state  diagram  in  Figure  3*1  indicates 
the  defined  state  transitions,  and  other  command  occurranccs  not  defined  there  will  have  no  effect 


Abort 


(Make  result  permanent  and 
inform  user  interface  process) 


Figure  3*1:  State  transition  diagram  of  the  remote  server 
and  the  user  interface  processes 

When  the  system  is  initialized,  SJO]  =  Sg[0]  =  0.  Then,  when  a  user  issues  an  execute  command,  Su  will 
be  updated  first  and  SJ1]  =  1  The  EXECUTE  command  updates  SJO],  and  also  updates  SJO]  (via  messages), 
resulting  in  SJ1]  =  2.  That  is,  both  the  virtual  and  remote  servers  go  to  the  Execute  state.  Suppose  that 
suddenly  a  user  discovers  that  something  is  wrong  and  he  issues  an  abort  command,  while  at  the  same  time 
the  server  issues  a  computed  signal  (indicating  drat  the  computation  is  done  and  the  result  is  ready  to  be 
made  permanent).  At  this  point,  there  is  a  conflict  between  the  COMPUTED  signal  and  the  abort  command. 
Since  Su  must  be  updated  first  and  is  local  to  the  user,  the  ABORT  command  is  applied  to  S  Jl]  first,  making 
SJ2]  =  0.  When  the  COMPUTED  signal  reaches  the  user  interface  it  will  find  that  Su  is  in  the  Idle  state,  and 
will  have  no  effect  On  the  other  hand,  the  ABORT  command,  after  updating  S  Jl],  will  update  S  Jl]  and  cause 
SJ2]  =  0.  Therefore,  the  ABORT  command  wins  the  conflict  resulting  in  the  system  returning  to  the  Idle 
state.  Suppose  now  that  the  user  accidentally  issues  an  INTERRUPT  command.  The  interface  process  would 
check  its  state  variable  and  find  that  S  J2]  =  0.  Thus,  the  interrupt  command  would  be  considered  invalid 
and  the  interface  process  would  warn  the  user  based  on  its  local  information  alone,  and  the  server  would  not 
be  affected.  Thus,  traffic  is  minimized;  there  will  be  no  messages  between  the  two  processes  unless  they  bring 
about  the  state  transitions. 


3.4.1 .3  Example:  Process  Creation  and  Destruction 

This  example  arose  from  the  Spice  remote  file  server  (Schaffer  82)  running  on  the  Accent  network  operating 
system,  with  Unix  as  the  local  host  operating  system.  The  basic  structure  of  the  remote  file  server  consists  of  a 
parent  process  and  a  set  of  child  processes  created  to  handle  users'  file  manipulation  messages.  A  child 
process  maintains  a  data  port  for  each  of  the  opened  files.  The  maximum  number  of  such  ports  that  can  be 
supported  by  a  child  process  is  twenty,  due  to  the  limitation  of  Unix  on  the  maximum  number  of  open  files  a 
process  may  have.  When  a  user  first  sends  a  request  to  open  a  file,  a  child  process  will  be  created  for  him. 
When  the  user  wants  to  open  more  than  twenty  files,  an  additional  child  processes  will  be  created  for  him.  A 
child  process  should  be  destroyed  when  it  has  closed  all  its  ports. 

Since  the  creation  and  destruction  of  a  child  process  is  a  function  of  the  number  of  ports,  the  parent  process 
must  keep  a  record  of  the  number  of  ports  that  each  child  process  currently  has.  Thus,  let  Cl.n  be  a  local 
variable  which  counts  the  number  of  ports  at  child  Cl,  and  let  PIji  be  the  parent’s  local  variable  which 
indicates  the  number  of  ports  in  Cl.  The  standard  solution  is  to  construct  an  atomic  data  set  consisting  of 
{Cl.n,  Pl.n},  with  the  data  invariant  "Clai  =  Pl.n".  This  data  invariant  can  be  maintained  by  requiring  all 
conflicting  transactions  to  be  mutually  exclusive  with  respect  to  the  version  numbers  of  the  data  objects. 

However,  there  is  a  problem  with  this  standard  solution:  it  keeps  a  parent’s  record  consistent  with  the 
actual  number  of  ports  at  the  child  process  for  all  the  values.  The  OPEN  FILE  and  close  file  command  pair 
associated  with  each  accessed  file  causes  the  number  of  pons  at  the  child  process  to  be  incremented  and 
decremented.  This  results  in  two  sets  of  conformal  operations  to  update  a  parent’s  record.  There  is  one  parent 
process  for  many  children,  and  the  creation  and  destruction  of  ports  occurs  frequently,  so  the  number  of 
conformal  operations  needed  tends  to  be  large.  Thus,  the  parent  process  becomes  a  performance  bottle-neck. 

This  raises  the  question  of  whether  Pl.n  has  to  equal  Cl  ji  at  all  times  and  for  all  values.  In  fact  most  of  the 
message  traffic  is  generated  to  maintain  a  non-critical  relation  that  could  be  more  efficiently  maintained 
probabilistically.  Note  that  there  are  only  two  important  values  of  the  port-count,  zero  and  twenty.  A  port 
count  of  zero  requires  the  destruction  of  the  child  process,  while  a  count  of  twenty  requires  the  creation  of  a 
new  child  when  a  user  wants  to  open  more  files.  Furrheimore,  we  only  need  Pl.n  equal  to  Clm  with  some 
probability  when  the  port-count  is  twenty.  If  the  parent  underestimates  the  number  of  ports,  additional  open 
file  requests  will  be  sent  to  the  child  process.  However,  the  child  process  can  return  the  requests  to  the  parent 
saying  that  he  already  has  twenty  ports.  If  the  parent  overestimates  the  number  of  ports,  a  new  child  might  be 
unnecessarily  created.  The  time  and  resources  required  for  that  are  acceptable  in  this  application.  In 
particular,  the  probability  of  creating  unnecessary  child  processes  is  small,  because  most  users  need  less  than 
twenty  ports. 
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A  port-count  of  zero,  however,  is  critical  because  serious  abnormalities  could  occur  as  a  result  of  the 
premature  destruction  of  a  child  process.  For  example,  a  child  process  with  ports  could  be  destroyed.  Since  a 
child  cannot  predict  the  arrival  time  of  a  new  open  file  command  from  a  user,  the  child  could  create  a  new 
port  after  sending  a  message  to  the  parent  process  indicating  that  it  has  closed  all  the  ports.  If  a  child  cannot 
inform  its  parent  of  his  status  change  in  time,  it  could  be  destroyed  by  the  parent  who  thinks  that  the  child  has 
no  more  ports.  Note  that  this  problem  cannot  be  solved  by  letting  the  parent  wait  a  bit  longer  after  he  is 
informed  that  the  child  has  no  more  ports.  This  is  because  the  arrival  time  of  a  new  OPEN  RLE  command 
firom  a  user  is  unpredictable.  In  fact,  until  the  user  logs  out,  the  system  cannot  predict  when  a  user  will  issue  a 
new  open  FILE  command. 

Since  a  port-count  of  zero  is  the  only  critical  value,  we  can  formulate  a  partial  dependency  relation  as 
follows.  A  child  and  its  parent  parent  process  are  in  an  autonomous  phase  as  the  port  count  varies  from  one 
to  twenty,  and  they  are  in  a  dependent  phase  when  the  port  count  is  zero.  In  addition,  when  a  child  has 
twenty  ports,  we  want  its  parent  process,  to  have  a  port  count  of  twenty  with  relatively  high  probability.  This 
is  summarized  as: 

PIji  ■  CIji  —  with  higher  probability, 
when  the  child  process  enters  or 

leaves  the  state  of  twenty  ports. 

PIji  ■  CIji  —  deterministically, 

when  the  child  process  enters  or 

leaves  the  state  of  zero  ports. 

This  could  be  implemented  by  having  the  child  process  send  a  port-count  message  to  its  parent  process 
when  it  enters  or  leaves  the  state  of  twenty  ports.  No  effort  is  made  to  guarantee  that  PIji  is  equal  to  CIji 
with  respect  to  all  concurrent  accesses.  When  the  child  process  enters  or  leaves  the  state  of  zero  ports,  it 
initiates  a  conformal  transaction  that  brings  about  a  phase  transition  and  guarantees  Pl.n  equal  to  CIji  with 
respect  to  all  concurrent  accesses.  When  a  child  has  ports  between  two  and  nineteen,  it  will  not  automatically 
send  any  message  to  its  parent  because  these  values  are  not  relevant  to  the  creation  or  destruction  of  the  child 
process.  However,  when  a  child  is  interrogated  by  its  parent,  it  will  report  its  current  number  of  ports  via  a 
simple  message.  This  is  to  permit  the  operating  system  to  sample  the  number  of  opened  files  for  reasons  other 
than  process  creation  and  destruction. 

By  introducing  probabilistic  co-operation,  the  communication  between  the  parent  and  the  child  for  the 
purpose  of  process  creation  and  destruction  is  dramatically  reduced.  There  is  essentially  one  transaction 
needed  during  the  life  time  of  a  child  process,  independent  of  the  number  of  files  accessed  by  a  user.  That 
transaction  is  the  one  that  destroys  a  child  process  and  alters  its  parent’s  record.  Only  in  the  rare  instances 
when  some  users  need  more  than  twenty  outstanding  open  files  are  there  additional  message  exchanges 
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among  parents  and  their  child  processes.  Actual  implementation  and  testing  has  confirmed  that  this  formula¬ 
tion  solves  the  synchronization  problem  with  a  significant  improvement  in  performance  (due  to  the  reduced 
message  traffic  and  message  processing  time  in  parent  processes). 

This  example  demonstrates  that  in  a  message  based  system  the  cost  of  keeping  state  variables  consistent  all 
the  time  could  be  high,  even  on  a  uni-proce^'or.  We  believe  that  in  a  distributed  system  the  cost  of  keeping 
distributed  state  variables  consistent  is  much  higher.  Therefore,  it  is  worthwhile  to  have  mechanisms,  such  as 
distributed  co-operating  processes,  that  permit  the  separation  of  the  critical  parts  of  relationships  that  need  to 
be  preserved  deterministically  from  the  non-critical  parts  that  can  be  preserved  probabilistically. 

3.4.2  Co-operating  Transactions 
3.4.2. 1  A  New  Concept 

Atomic  transactions  are  vital  to  distributed  database  systems,  because  they  allow  the  consistency  constraints 
of  distributed  data  objects  to  be  preserved  despite  the  failure  of  individual  pieces  of  the  system.  A  decentral¬ 
ized  global  operating  system  requires  the  same  kind  of  failure  atomicity,  and  so  must  be  constructed  with  a 
transaction  facility  in  its  kernel  [Jensen  80]. 

Unfortunately,  the  serialization  model  developed  for  distributed  database  systems  places  a  fundamental 
limitation  on  the  use  of  transactions:  i.e,  they  can  model  only  sequential  actions  or  concurrent  actions  that  are 
logically  equivalent  to  sequential  actions.  Yet,  a  significant  pan  of  operating  system  software  takes  the  form 
of  co-operating  processes.  The  two  way  communications  among  co-operating  processes  make  it  impossible  to 
transform  co-operating  processes  into  co-operating  transactions  without  violating  the  relative  ordering  re¬ 
quirement  of  the  serialization  model.  One  of  the  achievements  of  our  relational  model  of  data  consistency  is 
that  it  provides  a  foundation  for  formulating  co-operating  transactions. 

From  an  application  point  of  view,  die  need  for  co-operating  transactions  arises  from  the  desire  to  make  the 
actions  of  co-operating  processes  atomic.  For  example,  consider  the  hypothetical  case  of  loan  activities  within 
a  group  of  independent  banks  whose  computers  are  connected  by  a  network.  Normally,  a  bank  would  handle 
loan  applications  by  itself;  however,  if  an  acceptable  loan  requires  more  than  10%  of  the  bank’s  current 
capital  the  bank  must  (because  of  government  regulations)  ask  other  banks  to  syndicate  the  loan.  We  can 
model  this  as  a  set  of  co-operating  processes,  each  of  which  encapsulates  its  own  confidential  financial 
database.  Normally,  a  process  operates  in  the  autonomous  phase  to  handle  loan  applications  by  itself.  The 
co-operation  starts  when  a  process  is  asked  to  join  the  loan  syndication.  Once  asked,  a  server  will  examine  its 
own  loan  portfolio  to  determine  whether  it  should  accept,  refuse,  or  try  to  negotiate  the  terms.  Although  the 
formulation  of  co-operating  processes  models  the  loan  activity  well  (i.e„  a  group  of  independent  processes 


who  sometimes  co-operate),  it  has  a  reliability  problem.  When  a  computer  involved  in  a  syndicated  loan 
crashes,  the  financial  database  containing  the  banking  accounts  involved  in  the  loan  activities  might  be  in  an 
inconsistent  state.  This  is  not  acceptable,  and  these  process  interactions  must  be  made  atomic  to  help 
eliminate  this  problem. 

Co-operating  transactions  are  transactions  that  communicate  with  each  other  and  satisfy  the  conformity 
condition.  There  arc  two  types  of  data  objects  manipulated  by  co-operating  transactions.  The  first  is  the  state 
variables  of  co-operating  transactions.  As  with  co-operating  processes,  the  partial  dependency  relations 
among  state  variables  define  the  co-operation.  The  operands  of  the  co-operating  transactions  are  the  second 
type  of  data  object.  The  manipulation  of  operands  represents  the  external  effects  visible  to  the  users.  Since 
operands  are  organized  in  the  form  of  disjoint  atomic  data  sets,  co-operating  transactions  can  be  structured  in 
the  form  of  nested  transactions.  Each  of  the  sub- transactions  of  a  co-operating  transaction  operates  on  one  or 
more  atomic  data  sets  and  satisfies  the  conformity  condition. 

Now  we  turn  to  the  subject  of  managing  the  commit  process  of  a  co-operating  transaction.  A  sub¬ 
transaction  can  be  committed  if  and  only  if  the  action  invariants  of  both  the  sub-transaction  and  all  the  levels 
of  the  co-operating  transactions  are  satisfied.  Therefore,  an  invoked  sub-transaction  can  perform  only  the  first 
phase  of  a  two  phase  commit  protocol  and  must  leave  the  final  decision  of  whether  to  complete  or  abort  the 
commit  to  the  co-operating  transaction.  For  example,  suppose  that  in  the  loan  syndication  problem,  bank  A 
originates  the  loan  syndication  request,  and  bank  B  agrees  to  participate.  The  sub-transactions  invoked  in  A 
and  B  for  handling  that  loan,  such  as  the  transferring  M1  dollars  from  B  to  A,  and  transferring  the  total 
amount  of  M2  dollars  to  the  customer,  must  be  all  done  in  order  to  conclude  the  loan.  When  all  the 
sub-transactions  invoked  by  A  and  B  have  completed  their  first  phase  commit,  A  (the  originator  of  the 
syndicate)  will  follow  a  distributed  two  phase  commit  protocol  [Bernstein  80]  to  conclude  the  loan  syndica¬ 
tion. 

We  would  like  to  make  two  comments  on  this  example.  First,  the  reliability  problem  per  se  can  also  be 
solved  by  viewing  the  financial  records  of  each  bank  as  a  shared  database  and  using  conventional  serializable 
transactions.  However,  in  a  typical  database  approach  such  as  in  [Bernstein  80],  once  an  external  transaction 
obtains  the  write  lock,  the  database  is  directly  manipulated  by  the  transactions.  In  our  approach,  external 
transactions  can  only  indirectly  manipulate  another  bank's  financial  database  via  requests  to  the  active  local 
server.  It  is  often  important  to  restrict  external  users  from  direct  access  to  another  user's  (or  system)  data  in 
order  to  provide  some  degree  of  system  security.  Secondly,  co-operating  transactions  also  provide  better 
concurrency  due  to  the  fact  that  non-scrializable  concurrent  actions  are  permitted. 


3.4.2. 2  Example:  Graceful  Degradation 

This  example  arose  from  the  need  to  provide  a  reliable  authentication  service  in  the  Accent  network 
operating  system.  Since  the  database  managed  by  the  authentication  servers  is  vital  to  the  integrity  of  the 
entire  system,  it  is  required  that  the  loss  of  an  individual  system  element  result  only  in  the  loss  of  some 
performance.  Our  approach  to  solving  this  problem  is  to  use  co-operating  transactions.  The  three  basic  issues 
in  defining  co-operating  transactions  are:  1)  the  operand  atomic  data  sets:  2)  the  partial  dependency  relations 
among  co-operating  servers;  3)  the  definition  of  sub-transactions.  In  this  case,  there  are  two  types  of  operand 
atomic  data  sets.  The  first  is  a  capability  list  of  users  organized  as  access  group  lists.  The  second  is  records  of 
users’  registered  ports,  which  identify  processes  as  having  the  access  rights  of  their  users.  The  user  capability 
list  is  partitioned  to  improve  the  concurrent  of  accessing.  For  reliability  reasons,  each  pan  of  the  capability 
list  and  the  record  of  a  user's  registered  ports  are  replicated  and  distributed  in  two  physically  independent 
machines. 

The  system  authentication  servers  are.  organized  into  a  mutual  back-up  ring.  Suppose  that  there  are  three 
servers,  Sr  S2  and  S3,  residing  on  machines  one,  two  and  three,  respectively.  Let  the  partitioned  and 
duplicated  capability  lists  be  {Ljj,  Lu>,  {L^.  L^}  and  (L3J,  Lj  3},  where  the  first  subscript  corresponds  to 
the  server  who  is  responsible  for  the  set  of  the  two  copies  of  a  partitioned  list,  and  the  second  refers  to  the 
location  of  the  host  machine.  For  example,  the  set  {L^.  L^}  resides  on  machine  two  and  three,  and  is 
maintained  by  server  Sr  A  server  also  has  the  capability  to  manipulate  the  portion  of  the  atomic  data  sets 
that  resides  on  his  machine,  so  that  it  can  take  over  the  task  of  a  failed  server.  For  example,  server  two,  in 
addition  to  maintaining  the  set  (L^.  Ljj}.  also  takes  care  of  should  server  one  crash.  In  addition  to  the 
management  of  the  capability  list,  a  server  also  maintains  the  records  of  registered  ports.  These  records  are 
managed  in  the  same  way  as  the  capability  lists. 

The  partial  dependency  relation  among  servers  is  as  follows.  Normally,  servers  are  working  independently. 
Each  of  them  maintains  the  atomic  data  sets  for  which  it  is  responsible.  Co-operation  among  servers  is 
triggered  by  the  events  representing  the  failure  or  recovery  of  a  server.  In  Accent,  the  interprocess  com¬ 
munication  sub-system  automatically  monitors,  and  polls  if  necessary,  each  process.  Once  the  failure  of  a 
process  is  detected,  the  interprocess  communication  facility  will  inform  the  relevant  parties.  The  neighbors  of 
a  failed  server  will  co-operatively  close  the  mutual  back-up  ring.  For  example,  if  S2  crashes,  S3  will  recover 
the  atomic  data  set  (such  as  by  getting  copies  from  Sj.  Furthermore,  S1  will  ask  S3  to  recreate  lost 
redundant  files  (such  as  L^)  on  machine  three.  The  co-operation  associated  with  the  closing  of  the  ring 
completes  when  all  the  relevant  atomic  data  sets  are  reconstructed.  From  that  point  on,  S3  (or  S3)  will  then 
manage  the  atomic  data  sets  that  were  managed  by  Sr  When  a  server  process  recovers,  it  will  inform  its 
neighbors  to  transfer  the  updated  atomic  data  sets  back  to  it.  When  all  the  file  transfers  are  done,  the 
recovered  server  resumes  its  duty.  c 


82 


The  definition  of  the  sub-transactions  for  this  example  is  straightforward.  A  sub-transaction  is  needed  to 
manage  the  capability  list,  another  is  needed  to  manage  records  of  registered  ports,  and  a  final  one  is  needed 
to  perform  file  management.  The  first  two  sub-transactions  are  used  in  normal  operations,  while  the  file 
management  sub-transactions  arc  used  in  reconstructing  the  atomic  data  sets  during  the  failure  and  recovery 
procedures  of  a  server.  The  action  invariants  at  the  server  level  (i.c„  the  co-operating  transaction  level)  is 
simply  that  all  invoked  sub-transactions  for  a  task  must  be  all  done.  For  example,  when  a  recovered  server  is 
inserted  back  into  the  ring,  there  are  two  file  transfer  sub- transactions  transferring  files  back  to  the  recovered 
one  from  its  two  neighbors  which  must  all  be  completed  in  order  to  conclude  the  insertion. 

3. 4. 2. 3  Example:  Distributed  Load  Leveling 

This  example  was  conceived  to  illustrate  the  communications  involved  in,  and  the  probabilistic  behavior  of, 
co-operating  transactions.  In  this  example  we  examine  the  problem  of  distributed  load  leveling  for  a  point-to- 
point  computer  network.  In  any  load  leveling  scheme,  there  are  two  major  problems  that  must  be  addressed 
-  the  first  is  providing  atomic  transfer  of  work  items  between  work  queues,  and  the  second  is  ensuring  the 
stability  of  the  load  leveling  operation.  The  atomicity  requirement  arises  from  the  need  to  guarantee  that 
work  items  will  not  be  lost  or  duplicated  should  a  node  crash  during  an  instance  of  load  leveling.  Instability 
may  result  from  the  lack  of  co-operation  among  load  leveling  activities.  For  example,  a  pair  of  heavily  loaded 
□odes  (nodes  A  and  B)  share  a  common,  lighdy  loaded  neighbor  (node  Q.  Nodes  A  and  B  might  simul¬ 
taneously  observe  that  node  C  is  lightly  loaded,  and  attempt  to  off-load  some  of  their  work  onto  iL  This 
would  result  in  node  C  becoming  heavily  loaded,  and  it  may  then  choose  to  redistribute  its  load  with  nodes  A 
and  B.  This  could  clearly  result  in  a  pathological  condition  in  which  work  items  are  repeatedly  redistributed. 

Thus,  for  distributed  load  leveling,  it  is  necessary  to  have  both  atomicity  of  work  item  transfers,  and  a  form 
of  demand-driven  co-operation  that  is  able  to  adapt  to  a  changing  environment  The  co-operating  transaction 
paradigm  is  a  formalism  that  provides  a  method  of  meeting  these  requirements,  while  permitting  highly 
concurrent  execution  of  the  nodes’  load  leveling  functions.  The  demand-driven,  adaptive  co-operation  be¬ 
tween  transactions  may  be  represented  by  probabilistic  relations  among  the  state  variables  of  the  transactions. 
The  co-operating  transaction  responsible  for  load  leveling  at  each  node  typically  operates  in  an  autonomous 
fashion  managing  the  node’s  work  queue  and  exchanging  load  information  with  other  nodes.  At  some  point 
in  time,  a  node  may  decide  that  it  is  in  the  best  interest  of  the  system  to  engage  in  an  instance  of  load  leveling. 
A  node  would  then  attempt  to  enter  into  a  co-operative  state  with  some  of  its  nearest  neighbors.  This  phase  of 
the  load  leveling  function  is  probabilistic  in  as  far  as  the  neighboring  nodes  are  not  constrained  to  enter  into  a 
co-operative  state  whenever  requested  to  do  so.  This  is  because  the  load  information  at  each  node  is  partial 
and  inaccurate.  In  the  event  that  none  of  the  neighboring  nodes  agree  to  enter  into  co-operation  with  the 
requesting  node,  the  request  must  be  withdrawn  and  (possibly)  reattempted  at  a  later  point  in  time.  On  the 
other  hand,  should  a  node  be  successful  in  entering  a  co-operative  state  with  one  or  more  of  its  neighbors,  the 


group  of  co-opcrating  nodes  collectively  enter  inco  a  negotiation  phase  in  which  it  is  determined  how  the  load 
associated  with  the  group  should  be  distributed  in  order  to  best  accomplish  load  leveling.  It  should  be  noted 
that  the  group  of  nodes  involved  in  co-operation  with  the  node  initiating  the  load  tevcling  attempt  could 
extend  beyond  its  nearest  neighbors  if  a  non-neighboring  node  simultaneously  entered  into  a  co-operative 
state  with  a  common  neighboring  node. 

In  general  nodes  in  the  co-operating  group  will  cany  out  decisions  that  result  from  the  negotiation  within 
the  group.  However,  due  to  the  dynamic  nature  of  local  work  item  generation  and  consumption,  a  node's 
load  could  be  substantially  different  at  the  time  a  load  transfer  is  attempted  from  when  the  group  plan  was 
devised.  It  is  therefore  desirable  for  the  system  to  permit  local  adjustment  to  the  group  plan  whenever  the 
situation  warrants.  Allowing  local  adjustment  is  another  example  of  the  probabilistic  co-operation  in  this 
example,  in  that  there  is  no  absolute  guarantee  that  the  original  load  leveling  scheme  will  be  carried  out  as 
planned.  For  example,  the  original  group  plan  might  require  that  node  A  transfer  ten  work  items  to  node  B. 
However,  before  the  transfer  is  complete,  node  B  receives  a  block  of  locally  generated  work  items.  In  this 
situation  it  may  be  subsequently  determined  that  the  interests  of  the  system  are  best  served  by  transfering 
only  five  of  the  ten  work  items.  An  advantage  of  using  co-operating  transactions  in  such  a  case  is  that  they 
permit  co-operation  (communication)  during  the  execution  of  the  transactions,  and  thus  are  able  to  adapt  to 
environments  that  change  quickly  with  respect  to  their  execution. 

In  the  co-operating  transaction  formulation,  each  node’s  work  queue  represents  an  operand  atomic  data  set 
which  is  encapsulated  by  the  co-operating  transaction  that  implements  a  node’s  load  leveling  function.  The 
basic  sub-transactions  involved  in  the  manipulation  of  nodes’  work  queues  are  add,  delete,  and  transfer. 
The  add  and  DELETE  sub-transactions  are  used  to  atomically  insert  and  delete  items  from  local  work  queues. 
The  TRANSFER  sub-transaction  carries  out  specified  transfers  of  work  items  by  invoking  the  destination  node's 
add  sub-transaction,  sending  the  work  items,  and  invoking  the  source  node's  delete  sub-transaction.  The 
action  invariant  of  the  transfer  sub-transaction  is  that  both  the  remove  and  insert  operations  must  be 
successfully  completed.  Since  job  queues  are  encapsulated  locally,  when  the  sender’s  transfer  sub-transaction 
attempts  to  invoke  the  receiver’s  ADD  sub-transaction,  the  receiver  may  modify  the  parameters  of  the  ADD 
sub-transaction.  In  the  above  example,  ten  jobs  are  sent  from  node  A  to  node  B,  however  node  B  takes  five 
work  items,  instead  of  ten,  and  informs  node  A  accordingly.  Although  node  A  can  reject  B’s  modification  by 
aborting  the  transaction,  A  may  well  re-execute  the  local  add  sub-transaction  and  commit  the  modified 
transfer.  This  is  clearly  more  efficient  than  blindly  canying  out  the  original  plan  and  having  to  remedy  it 
later. 

Finally,  it  should  be  noted  that  the  atomicity  of  work  item  transfers  per  se  can  be  solved  by  using  a  typical 
database  approach  based  on  the  serialization  model.  However,  this  would  be  done  at  the  cost  of  concurrency. 
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protection,  and  performance.  'ITtc  loss  of  concurrency  is  due  to  the  relative  ordering  requirement  imposed  by 
the  serialization  model,  which  is  unnecessary  for  work  item  transfers.  In  the  co-operating  transaction  for¬ 
mulation,  the  relationships  between  any  two  work  queues  are  autonomous.  The  integrity  of  work  item 
transfers  are  represented  by  the  action  invariant  "both  the  destination  node’s  add  sub-transaction  and  the 
source  node's  DELETE  sub-transaction  must  be  done  or  neither  is  done”.  The  work  item  transfer  sub¬ 
transactions  can  be  done  in  any  relative  order,  and  may  or  may  not  be  serializable.  The  degree  of  protection  is 
reduced  because,  with  co-operating  transactions,  each  work  queue  is  encapsulated  by  a  local  load  leveling 
transaction  which  controls  access  to,  and  maintains  the  consistency  of,  the  queues.  In  a  typical  database 
approach,  one’s  own  work  queue  may  be  arbitrarily  manipulated  by  any  transaction  that  obtains  a  write  lock. 
Finally,  performance  is  sacrificed  with  serializable  transactions  due  to  the  fact  that  they  arc  not  able  to  adapt 
to  a  changing  environment,  as  can  co-operating  transactions  which  may  communicate  in  the  course  of  their 
operation. 

3.4.3  Conclusion 

Our  initial  experiments  with  applying  these  ideas  to  distributed  operating  systems  have  been  very  encourag¬ 
ing.  We  believe  they  are  valuable  in  network  operating  systems  but  essential  in  a  decentralized  operating 
system  such  as  ArchOS  [Jensen  82]  will  be.  The  kinds  of  interaction  amenable  to  our  approach  to  co¬ 
operating  processes  and  transactions  are  not  yet  delineated.  Neither  is  it  yet  very  clear  what  all  the  implica¬ 
tions  of  these  concepts  could  be  on  suitable  operating  system  structures.  Our  research  and  experiments  are 
continuing,  and  will  be  reported  in  the  literature, 
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4.  Interprocess  Communication 


4.1  Overview 

Interprocess  communication  (IPC)  is  vital  to  performing  decentralized  computations.  We  have  not  taken 
the  usual  approach  in  distributed  systems  of  simply  designing  a  facility  for  IPC.  Our  research  objectives 
demand  that,  as  our  understanding  of  decentralized  computations  and  operating  systems  grows,  we  must  be 
able  to  change  the  IPC  facility  quickly  and  easily  to  provide  appropriate  support.  We  arc  pursuing  the  use  of 
a  technique  called  "policy /mechanism"  separation  in  the  design  and  implementation  of  IPC  facilities. 

Briefly,  a  policy  is  defined  as  a  specification  of  the  manner  in  which  a  set  of  resources  are  managed,  and  a 
mechanism  is  defined  as  the  means  by  which  policies  are  carried  out  [Brinch  Hansen  70].  Policy /mechanism 
separation  is  a  structuring  methodology  that  segregates  policies  that  dictate  resource  management  strategies 
from  mechanisms  that  implement  the  lower-level  tactics  of  resource  management  Policy/mechanism  separa¬ 
tion  can  be  applied  to  a  system  constructed  in  a  layered  fashion;  the  facility  provided  at  a  given  level  may  be 
implemented  by  a  policy  in  terms  of  mechanisms,  and  that  facility  may  in  mm  be  used  to  create  mechanisms 
at  the  next  higher  level. 

The  design  and  implementation  of  IPC  facilities  are  an  important  pan  of  multiprogramming  .systems  in 
general,  and  is  critical  to  "distributed  systems".  Furthermore,  because  IPC  facilities  have  great  impact  on  the 
systems  of  which  they  are  a  part,  serious  thought  must  be  put  into  their  functionality  and  structure. 

Policy/mechanism  separation  has  been  shown  to  be  valuable  in  the  design  of  general  operating  system 
facilities  [Brinch  Hansen  70,  Kahn  81,  Wulf  74],  but  primary  emphasis  has  been  on  the  area  of  process 
scheduling  [Bernstein  71,  Levin  75].  Furthermore,  until  now  there  have  been  no  explicit  attempts  at  applying 
these  principles  specifically  to  IPC  facilities.  There  is  reason  to  believe  that  policy /mechanism  separation  is 
likely  to  prove  useful  in  achieving  a  number  of  goals  for  IPC  facilities,  such  as: 

•  the  flexibility  to  create  a  wide  range  of  different  facilities, 

•  support  for  multiple,  different,  coexistent  IPC  facilities,  and  a 

•  viable  approach  to  providing  hardware  support  for  IPC. 

This  research  will  result  in  a  set  of  IPC  mechanisms  that  will  support  the  implementation  of  a  wide  range  of 
IPC  facilities.  Another  contribution  will  consist  of  an  evaluation  of  the  policy/mechanism  approach  to  IPC, 
based  on  implementations  of  the  previously  specified  mechanisms  and  a  chosen  set  of  IPC  policies.  Further 
contributions  of  this  research  will  include  a  taxonomy  of  the  IPC  design  space,  and  a  logical  framework  to 


represent  various  implementations  of  a  range  of  1PC  facilities,  an  evaluation  of  the  degree  to  which  multiple 
IPC  facilities  can  be  simultaneously  supported,  and  whether  the  set  of  proposed  1PC  mechanisms  can  be 
effectively  supported  with  hardware  (which  would  include  descriptions  of  proposed  hardware  mechanisms). 

Although  there  has  been  a  great  deal  of  work  in  the  general  area  of  IPC  [Northeutt  83],  relatively  little  of 
that  work  is  strongly  related  to  the  research  outlined  in  this  document.  Of  the  many  different  types  of  IPC 
facilities  that  exist  or  have. been  proposed,  few  have  had  flexibility  (in  the  sense  of  permitting  a  range  of 
different  facilities)  as  a  goal,  although  some  contend  that  their  system  is  capable  of  implementing  a  wide  range 
of  IPC  facilities  [Rao  80].  Furthermore,  while  others  have  attempted  to  provide  hardware  support  for  their 
particular  IPC  facility  [Cox  81.  Ford  77.  Giloi  81.  Spier  731,  such  support  tends  to  be  unsubstantial  and  highly 
inflexible.  To  the  best  of  our  knowledge,  there  are  no  instances  of  IPC  facilities  explicitly  designed  and 
implemented  according  to  the  principles  of  policy/mechanism  separation.  This  is  despite  the  fact  that  some 
IPC  facilities  consist  of  operations  known  as  "primitives"  [Liskov  79]. 

4.2  The  Separation  of  Policy  and  Mechanism  in  IPC 

4.2.1  Introduction 

This  research  explores  the  separation  of  policy  and  mechanism  in  the  design  and  implementation  of  inter* 
process  communication  (IPC)  facilities.  Briefly,  a  policy  is  defined  as  a  specification  of  the  manner  in  which  a 
set  of  resources  are  to  be  managed,  and  a  mechanism  is  defined  as  the  means  by  which  policies  are  carried 
out  [Brinch  Hansen  70].  Policy  /mechanism  separation  is  a  structuring  methodology  that  segregates  policies 
that  dictate  resource  management  strategies  from  mechanisms  that  implement  the  low-level  tactics  of  resource 
management  This  technique  has  been  suggested  for,  and  applied  to.  the  design  and  implementation  of 
general  operating  system  facilities  [Levin  75].  Policy /mechanism  separation  can  be  applied  to  a  system 
constructed  in  a  layered  fashion;  the  facility  provided  at  a  given  level  may  be  implemented  by  a  policy  in 
terms  of  mechanisms,  and  that  facility  may  in  turn  be  used  to  construct  mechanisms  for  a  facility  at  the  next 
higher  leveL 

The  design  and  implementation  of  IPC  facilities  are  an  important  part  of  multiprogramming  systems  in 
general  and  are  critical  to  distributed  systems^.  In  systems  whose  software  is  constructed  as  a  collection  of 
conceptually  distinct  programming  elements  (e.g„  processes),  an  IPC  facility  is  the  fundamental  means  by 
which  the  components  of  the  system  communicate  with  one  another  (and  in  some  cases  with  other  system 
facilities).  In  general  IPC  facilities  have  a  great  impact  on  the  nature  of  systems  they  are  a  part  of;  a  great 

*We  use  this  term  in  the  popular  sense,  Le.,  meaning  any  system  with  more  than  one  pitx^nr. 
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deal  of  effort  should  therefore  be  put  into  the  design  and  implementation  of  these  facilities.  Not  only  docs 
the  logical  functionality  of  (PC  facilities  affect  the  structure  and  behavior  of  the  systems,  but  the  nature  of 
systems  themselves  affects  the  requirements  for  their  I  PC  facilities.  Furthermore,  both  the  degree  to  which  a 
system  places  demands  on  an  I  PC  facility  and  the  response  time  constraints  of  a  system  influence  the  ef¬ 
ficiency  requirements  of  a  system's  1PC  facility. 

Policy /mechanism  separation  has  been  shown  to  be  valuable  in  the  design  of  general  operating  system 
facilities  [Brinch  Hansen  70.  Cox  81.  Wulf  74].  At  this  time.. however,  there  have  been  no  explicit  attempts  at 
applying  these  principles  specifically  to  1PC  facilities:  the  primary  emphasis  has  been  in  the  area  of  process 
scheduling.  Nonetheless,  there  is  reason  to  believe  that  a  policy  /mechanism  separation  approach  could  prove 
useful  in  achieving  a  number  of  goals  for  I  PC  facilities. 

The  benefits  of  an  IPC  facility  based  on  policy /mechanism  separation  can  be  expected  to  include  the 
following: 

•  providing  an  IPC  facility  flexible  enough  to  permit  the  creation  of  a  wide  range  of  different  IPC 
facilities  through  the  application  of  various  policies  to  IPC  mechanisms: 

•  supporting  multiple,  different  "native"  IPC  facilities  that  can  simultaneously  coexist  at  the  same 
level  in  a  given  system:  and 

•  an  approach  to  providing  hardware  support  for  IPC  facilities,  to  improve  performance  without 
sacrificing  flexibility. 

Separating  policy  from  mechanism  yields  primitive  functions  (mechanisms)  with  which  various  IPC 
facilities  can  be  implemented  by  changing  policies  (Le,  many  specialized  IPC  facilities  can  be  implemented  in 
terms  of  a  single  set  of  mechanisms).  Policy /mechanism  separation  thus  results  in  highly  f  .*ible  IPC 
facilities.  This  property  is  particularly  usefol  for  a  testbed  system,  designed  for  experimentation  with  various 
(possibly  unforeseen)  operating  system  structures. 

It  should  be  noted  that  the  separation  of  policy  and  mechanism  in  the  design  and  implementation  of  IPC 
facilities  is  not  necessarily  being  suggested  as  a  general  approach  to  the  construction  of  IPC  facilities.  Rather, 
IPC  facilities  designed  according  to  foe  policy/mechanism  separation  approach  offer  certain  (unique) 
benefits,  and  lend  themselves  best  to  certain  specific  environments.  It  may  be  foe  case,  however,  that  this 
approach  to  IPC  facility  design  is  sufficiently  broad  in  its  applicability  that  it  might  be  used  more  generally. 
Such  an  occurrence  would  be  similar  to  cases  where  writeable  control  store  was  provided  in  a  prototype 
computer  architecture  for  development  purposes,  but  proved  to  be  so  useful  that  it  was  included  in  later 
production  versions  of  foe  computer. 


4.2.2  Background 

Before  we  can  further  discuss  the  separation  of  policy  and  mechanism  in  IPC.  it  is  important  that  the  notion 
of  (PC  be  somewhat  better  defined,  and  our  scope  of  interest  in  [PC  more  dearly  delineated.  In  addition,  the 
terms  policy  and  mechanism  must  be  defined  along  with  the  general  concept  of  policy/mechanism  separation. 

4.2.2. 1  Definition  of  Interprocess  Communication 

A  common  method  of  structuring  a  programming  system  is  to  construct  it  from  a  (possibly  hierarchically 
structured)  collection  of  program  entities.  These  entities  are  commonly  known  as  processes,  an  operating 
system  supported  abstraction  that  can  be  thought  of  as  the  basic  unit  of  computation  and  concurrency  in 
modular  programming  systems  [Habermann  76].  For  the  purposes  of  this  discussion,  we  consider  a  process  to 
be  a  unit  of  computation  that  is  serially  executed  on  an  underlying  (real  or  virtual)  machine,  in  (real  or 
virtual)  asynchronous  concurrency  with  respect  to  other  processes.  Despite  the  abundance  of  more  formal 
definitions  of  processes,  there  is  not  one  commonly  agreed  upon  nor  more  appropriate  for  our  immediate 
needs. 

To  cooperate,  processes  require  some  means  of  communication.  This  communication  can  take  many  forms, 
but  EPC  is  the  activity  of  deliberately  and  explkidy  exchanging  information  among  processes.  In  the  case 
where  the  processes  wishing  to  communicate  have  intersecting  domains,  IPC  can  be  carried  out  by  one 
process  instantiating  the  information  to  be  exchanged  in  a  shared  portion  of  the  processes’  domains;  com* 
munication  then  largely  consists  of  coordinating  access  to  the  shared  information.  Where  process  domains  are 
disjoint,  IPC  is  performed  by  moving  information  from  the  domain  of  one  process  to  the  domain(s)  of  one  or 
more  other  processes.  Throughout  this  research  we  will  consider  only  the  latter  case  of  IPC,  and  communica¬ 
tion  among  processes  by  such  means  as  shared  memory,  common  files,  etc.  is  not  included  here. 

At  the  next  lower  level  of  detail,  IPC  can  be  roughly  thought  of  as  being  composed  of  four  basic  activities: 

L  the  specification  of  the  participants  involved  in  a  given  instance  of  IPC  (Le-  which  processes  are 
involved  and  in  what  capacities,  e.&,  message  source,  message  destination,  etc); 

2.  the  instantiation  of  information,  initially  local  to  one  process,  in  the  domain  of  one  or  more  other 
processes  (i.e_  what  information  is  to  be  exchanged  and  how  the  exchange  is  to  be  carried  out,  e.g., 
reliably,  sequenced,  broadcast,  multicast,  etc.); 

3.  the  act  of  causing,  detecting,  or  being  made  aware  of,  various  events  involved  in  the  coordination 
of  communication  activities  (i.e„  how  are  the  participants  able  to  create  and  detect  events  such  as 
"message  queued  at  destination  process",  "message  accepted  by  communication  subsystem",  etc.); 

4.  the  interpretation  of  (at  least  portions  of)  the  information  instantiated  in  a  process’  domain  as  a 
result  of  an  act  of  IPC  (i.e.,  to  what  extent  the  entire  message  is  to  be  decoded  by  the  processes 
and  the  system). 


Each  of  these  fundamental  activities  must  be  performed  explicitly  or  implicitly  in  an  instance  of  IPC.  and  an 
1PC  facility  must  provide  all  of  the  functions  to  do  so.  Furthermore,  there  exist  a  great  many  ways  in  which 
these  basic  activities  can  be  provided  to  a  user  process,  and  different  IPC  facilities  provide  them  in  different 
forms. 

4.2.2. 1.1  The  Role  of  IPC  in  Programming  Systems 

A  common  form  of  structured  system  design  and  implementation  is  known  as  layering  [Dijkstra  68].  A 
layered  system  creates  successively  higher  levels  of  functionality  by  implementing  each  layer  in  terms  of  the 
underlying  layers2.  In  such  a  system,  the  peer  processes  at  each  layer  require  a  form  of  IPC  (known  as 
protocols  (Zimmermann  80])  among  themselves.  The  IPC  facilities  used  by  each  layer  could  be  the  identical, 
basic  IPC  service,  or  each  layer  could  make  use  of  a  different  IPC  facility  with  increasing  functionality.  For 
example,  in  the  RIG  system  a  fundamental  form  of  IPC  is  used  to  provide  access  to  more  elaborate 
forms  (Lana  80].  Thus  the  IPC  facility  in  a  system  could  also  be  layered  -  each  layer  of  IPC  could  be 
increasingly  rich  in  functionality,  and  die  IPC  service  at  each  layer  could  be  well  suited  to  the  type  of 
communication  that  occurs  among  processes  at  that  specific  level. 

In  a  layered  system,  a  form  of  communication  is  required  for  higher  layers  to  invoke  functions  provided  by 
lower  layers.  This  is  usually  known  as  an  interface,  Le.,  communication  among  processes  in  different 
layers  [Zimmermann  80],  While  operating  system  processes  at  layers  above  an  IPC  facility  can  use  that  facility 
for  both  interface  and  protocol  communication,  this  is  not  true  for  processes  using  the  lowest-level  (Le..  the 
fundamental)  IPC  facility.  These  processes  require  some  other  form  of  communication  (which  is  not  IPC  by 
our  definition)  to  interface  to  this  facility,  due  to  the  obvious  circularity  of  needing  to  use  a  facility  in  order  to 
access  (or  provide)  that  same  facility.  There  exist  a  number  of  different  means  through  which  processes  can 
access  fundamental  IPC  facilities,  including  procedure  calls,  language  constructs,  supervisor  call-type  instruc¬ 
tions,  etc- 

4.2.2. 1 .2  The  IPC  Facility  Design  Space 

In  addition  to  the  wide  variety  of  options  that  exist  for  the  implementation  of  IPC  in  a  system,  there  are 
many  different  types  and  degrees  of  functionality  that  can  be  provided  by  an  IPC  facility.  IPC  can  range  from 
a  simple  device-like  form  to  a  complex,  transaction-oriented  facility.  To  illustrate  the  possible  differences  in 
CPC  facility  design,  a  number  of  options  are  given  below.  Note  that  there  is  no  attempt  to  suggest  that  these 
are  the  most  common  variations,  nor  is  anything  to  be  inferred  from  the  order  in  which  they  appear.  Further¬ 
more,  it  should  not  be  assumed  that  all  of  these  options  are  compatible  with  one  another. 

•  An  instance  of  IPC  is  usually  initiated  by  one  of  the  processes  involved  in  the  communication:  the 


a  stnctly  layered  system.  each  layer  s  defined  delusively  in  terms  of  the  objects  (Le.  data  structures  and  operations  on  them) 
provided  by  the  layer  immediately  below  tt  < 


process  that  generated  the  information  to  be  transferred  (i.c..  source  initiated ):  the  intended 
recipient  of  die  information  transfer  (i.c„  destination  initialed):  or  alternatively  a  process  that  is 
neither  source  nor  destination  of  the  transfer  (i.c..  third  party  initiated)  (Jensen  78a], 

•  It  is  necessary  that  the  parties  involved  in  a  given  act  of  communication  be  specified.  This  can 
occur  by  addressing  a  message  with  the  namc(s)  of  the  destination  process  (i.e.,  destination 
addressing;),  the  name  of  the  source  process  (i.c.,  source  addressing),  or  by  defining  a  special  "tag" 
field  (i.c.,  content  addressing).  In  addition  to  these  addressing  methods,  it  is  possible  to  perform 
implicit  addressing  by  the  use  of  connections,  or  by  using  a  logical  service  address.  (Note  that  one 
or  more  of  these  pieces  of  information  could  be  included  within  a  message,  but  in  this  discussion 
we  arc  referring  only  to  the  information  used  in  performing  the  act  of  addressing.) 

•  The  actual  exchange  of  units  of  information  (e.g„  messages)  can  take  place  in  many  different  ways.- 
Much  like  parameters  passed  in  procedure  calls,  messages  could  be  passed  by  value,  reference,  or 
/unction.  Also,  the  relationship  between  sources  and  destinations  of  messages  could  be  one-to-one 
(i.e„  two-party),  one-to-many  (i.e>,  multicast),  one-to-all  (Le.,  broadcast),  or  all-to-one  (Le^ 
promiscuous).  The  behavior  of  messages  with  respect  to  their  receive  semantics  could  be  once- 
and-only-once.  at-lcast-once,  or  something  different.  Message  transfers  could  even  be  guaranteed 
to  be  atomic  with  respect  to  other  message  transfers  (i.e_  a  transaction ). 

•  Control  information  passed  to  the  client  of  an  IPC  facility  from  the  facility  can  be  described  as 
being  either  imperative  or  interrogative.  In  the  imperative  form,  control  information  is  made 
available  without  explicit  action  on  the  pan  of  the  client  (e.g^  an  interrupt,  the  unblocking  of  a 
process,  etc.).  The  interrogative  form  of  control  information  transfer  requires  that  the  client  issue 
a  form  of  "query"  operation  to  obtain  the  control  information. 

The  type  of  information  that  may  be  passed  in  these  ways  includes  the  status  of  ongoing  com¬ 
munications  (e.g^  "message  accepted  by  the  local  communication  subsystem",  "message  accepted 
by  destination  processfes)”.  etc.),  or  the  state  of  the  communication  subsystem  (e.g^  "path  I 
operational",  N  messages  of  type  T  queued  for  process  P",  etc.). 

•  The  processes  involved  in  an  instance  of  IPC  can  have  a  number  of  different  relationships  among 
themselves,  with  respect  to  their  control  flow.  There  might  be  no  synchronization  between  the 
source  and  destination  processes  (i.e_  asynchronous  communication ).  either  the  source  or  the 
destination  process  could  suspend  execution  until  the  other  has  executed  a  send  or  receive  (i.e_ 
semi-synchronous  communication),  or  both  the  source  and  destination  processes  could  suspend 
until  the  other  issues  a  matching  send  or  receive  command  (i.e_  synchronous  communication). 

•  There  must  be  some  degree  of  agreement  on  the  format  of  messages  in  order  to  ensure  that 
processes  can  interpret  the  information  exchanged.  This  implies  that  the  format  of  messages  must 
either  be  entirely  fixed,  or  at  least  a  portion  that  describes  the  remainder  of  the  message  must  be 
fixed.  Furthermore,  if  the  communication  subsystem  must  interpret  the  contents  of  messages 
(e.g„  to  transform  local  capabilities  into  their  remote  manifestations),  the  message  format  must 
accommodate  this  either  by  fixed  fields  or  special,  reserved  markers. 

Different  IPC  facilities  can  coexist  in  a  single  system  (e.g.,  in  RIG  there  is  Rashid’s  IPC  along  with  a  variety 
of  facilities  based  on  Xerox  protocols  [Fleisch  81],  and  some  versions  of  UNIX3  have  both  pipes  and  Rashid’s 

3UNIX  b  a  registered  trademark  of  Bell  Laboraiooes. 
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IPC  (Rashid  80]).  In  addition  to  having  a  variety  of  IPC  facilities  at  different  layers  in  a  system,  it  is  possible 
for  the  same  functionality  to  be  implemented  at  different  layers.  For  the  most  part  the  issues  of  functionality 
and  layering  arc  independent  with  respect  to  IPC  facilities:  it  is  typically  the  ease,  however,  that  functionality 
increases  as  IPC  appears  in  higher  layers  in  a  system. 

4.2.2. 1 .3  Our  Scope  of  Interest  in  the  Universe  of  IPC  Facilities 
Out  of  the  universe  of  possible  IPC  facilities,  we  arc  confining  our  present  interest  to  a  specific  subspace. 
This  is  intended  to  restrict  the  emphasis  of  our  work  to  the  forms  of  IPC  that  we  consider  to  be  the  most 
appropriate  for  distributed  systems  in  general,  and  most  relevant  to  our  research  on  the  Archons  project  in 
particular  [Jensen  83).  By  restricting  our  scope  of  interest  in  the  IPC  design  space,  we  are  attempting  to 
reduce  the  number  of  IPC  facilities  we  must  consider  by  eliminating  those  facilities  which  (in  our  opinion) 
have  undesirable  or  uninteresting  characteristics. 

The  IPC  facilities  of  greatest  interest  to  us  in  this  research  share  the  following  characteristics: 

•  they  are  based  on  message  passing  (as  opposed  to  procedure  calls,  etc.); 

•  communication  is  primarily  via  an  explicit  IPC  facility  (not  shared  memory,  shared  files,  I/O, 
etc.); 

•  all  IPC  is  performed  with  the  explicit  consent  of  all  the  communicating  processes  (not  by  a 
unilateral  action  on  the  part  of  some  arbitrary  process). 

4. 2. 2. 2  The  Separation  of  Policy  and  Mechanism 
The  concept  of  policy /mechanism  separation  was  described  by  Brinch  Hansen  in  1970  [Brinch  Hansen  70] 
and  applied  in  the  RC4000  system  [Brinch  Hansen  71].  Other  notable  systems  which  attempted  to  separate 
policy  and  mechanism  within  their  operating  systems  include  the  Hydra/C-mmp  system  [Wulf  74]  and  the 
iAPX  432/iMAX4  system  [Kahn  81].  Experience  has  shown  policy/mechanism  separation  to  yield  a  number 
of  benefits  in  the  design  and  implementation  of  systems. 

4. 2.2. 2.1  Policy/Mechanism  Separation  as  a  General  Structuring  Methodology 
There  are  a  number  of  concepts  associated  with  the  design  and  implementation  of  computer  systems  that 
are  (at  least  superficially)  related  to  the  notion  of  policy/mechanism  separation,  the  most  obvious  of  which  is 
abstraction.  Policy /mechanism  separation  could  be  thought  of  as  a  form  of  abstraction,  in  that  policies  define 
higher-level  functions  implemented  in  terms  of  lower-level  ones  (i.e.,  mechanisms).  It  is  more  useful 
however,  to  consider  policy/mechanism  separation  as  a  technique  for  implementing  a  given  layer  of 
functionality,  which  involves  partitioning  the  layer  into  a  part  that  dicutes  behavior  and  a  part  that  carries  it 
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Another  related  concept  is  that  of  separating  specification  from  implementation.  In  a  sense,  a  policy  is  a 
specification  of  a  function  and  the  mechanisms  used  to  carry  the  policy  out  arc  its  implementation.  However, 
this  is  best  thought  of  as  an  issue  that  is  orthogonal  to  that  of  policy/mcchanism  separation:  the  design  and 
implementation  of  the  policy  and  mechanism  portions  of  a  facility  could  be  performed  by  separating  the 
specification  and  implementation  of  each  pan. 

Information  hiding  (Pam.as  72]  is  a  concept  also  related  to  policy/mechanism  separation,  inasmuch  as 
policies  are  implemented  in  terms  of  mechanisms  that  serve  to  isolate  the  policy  maker  from  the  details  of  the 
mechanisms'  implementations.  However,  the  primary  objective  of  information  hiding  is  to  insulate  the 
interface  of  a  facility  from  internal  changes  in  the  facility's  implementation.  This  is  as  opposed  to 
policy/mechanism  separation  which  attempts  to  insulate  the  internal  implementation  of  a  facility  from 
changes  in  its  external  interface. 

According  to  our  interpretation  of  this  concept,  we  now  define  some  terms  and  present  a  simple  view  of 
system  structure  based  on  the  separation  of  policy  and  mechanism. 

•  Facility,  a  service  characterized  by  a  collection  of  operations  that  comprise  its  interface.  A 
facility,  implemented  according  to  a  policy /mechanism  separation  approach,  consists  of  a  collec¬ 
tion  of  mechanisms  and  a  policy  which  governs  the  manner  in  which  the  mechanisms'  constituent 
primitives  are  invoked. 

•  Policy:  a  plan  of  action  relating  to  the  management  of  a  collection  of  resources,  based  on  "global" 
objectives,  general  goals,  and  acceptable  procedures.  In  facilities  implemented  according  to  a 
policy/mechanism  separation  approach,  policies  are  carried  out  by  the  invocation  of  primitives. 

•  Mechanism :  a  related  collection  of  functions  that  carry  out  various  aspects  of  a  common  function. 
Mechanisms  are  used  to  carry  out  policies  in  policy/mechanism  separation  implementations  of 
facilities. 

•  Primitive,  a  function  that  carries  out  a  single  aspect  of  particular  function.  Primitives  are  the 
entities  which  are  invoked  in  order  to  carry  out  an  operation  on  behalf  of  a  higher-level  entity.  A 
mechanism  is  composed  of  primitives  that  perform  related  operations. 

The  conceptual  boundary  between  policy  and  mechanisms  (for  a  given  facility)  might  be-  viewed  as  the 
separation  between  a  pair  of  layers  in  a  functionally  layered  structure.  Also,  it  is  dear  that  facilities  that  exist 
at  a  certain  level  may  support  (or  implement)  mechanisms  at  higher  layers  in  a  system.  However,  all 
discussion  of  policy/mechanism  separation  in  this  document  should  be  assumed  to  be  in  the  context  of  a 
facility  within  a  single  layer  of  a  system.  Policy /mechanism  separation  is.  in  effect,  a  methodology  that  guides 
the  implementation  of  a  given  layer. 

A  given  facility  is  implemented  by  making  use  of  mechanisms  according  to  a  given  policy.  The  same 
mechanism  may  be  used  in  more  than  one  facility,  and  a  given  facility  could  be  implemented  using  the  same 
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policy  but  different  mechanisms.  Also,  separate  facilities  may  be  simultaneously  provided  by  different 
policies  implemented  in  terms  of  the  same  set  of  mechanisms.  However,  arbitrary  policies  may  not  be 
compatible,  and  as  such  may  not  be  capable  of  simultaneously  coexisting  in  a  system.  On  the  other  hand,  the 
choice  of  a  given  mechanism  tends  to  be  largely  independent  of  other  mechanisms.  The  choice  of 
mechanisms  can  affect  the  types  of  policies  that  can  be  carried  out  and  the  cost  of  carrying  out  the  policies. 

An  example  of  a  facility  at  the  operating  system  level  is  one  that  permits  the  multiplexing  of  a  physical 
processor.  A  scheduling  facility  could  be  implemented,  according  to  such  policies  as:  Round-Robin. 
Shortest- Processing-Time- First,  or  Priority.  Any  of  these  policies  might  be  implemented  in  terms  of  the  same 
processor  multiplexing  mechanism,  which  could  consist  of  a  set  of  operations  such  as:  "define  the  selection 
discipline",  "select  one  of  N  processes",  "stop  currently  active  process",  and  "start  process  F'.  A  scheduling 
facility  implemented  with  such  mechanisms  might  exhibit  the  characteristics  of  policy /mechanism  separation. 

4.2. 2.2. 2  The  Separation  of  Policy  and  Mechanism  in  IPC 

Despite  the  fact  that  the  separation  of  policy  and  mechanism  has  been  (more  or  less)  successfully  applied  to 
various  parts  of  a  number  of  systems,  IPC  has  not  yet  received  the  benefit  of  such  a  treatment.  To  this  point, 
the  primary  emphasis  on  applying  policy/mechanism  separation  has  been  in  the  area  of  process  scheduling 
and  memory  management  [Levin  75].  Among  the  arguments  for  not  applying  policy /mechanism  separation 
to  IPC  facilities  might  be:  due  to  its  complexity,  IPC  is  a  facility  which  does  not  readily  lend  itself  to  such  an 
effort;  it  does  not  make  sense  to  separate  policy  from  mechanism  in  IPC.  because  it  is  such  a  low-level  facility 
that  the  benefits  are  outweighed  by  the  costs:  or  there  is  nothing  to  be  gained  from  an  endeavor  of  this  sort 
that  couldn’t  be  better  accomplished  in  some  other  fashion.  Each  of  these  objections  will  be  shown  to  be 
unreasonable  in  some  cases.  Furthermore,  a  number  of  counterarguments  can  be  made  which  suggest  there 
is  value  in  investigating  policy/mcchanism  separation  with  respect  to  IPC. 

4. 2. 2. 3  Interprocess  Communication  for  Decentralized  Computer  Systems 

One  of  the  objectives  of  this  research  is  to  determine  the  degree  to  which  the  separation  of  policy  and 
mechanism  will  provide  an  effective  methodology  for  the  design  and  implementation  of  a  flexible  IPC  facility. 
This  is  of  special  interest  to  us  because  a  great  deal  of  flexibility  is  required  of  an  IPC  facility  for  the  support 
of  research  on  decentralized  operating  systems  (DOS’s)  [Jensen  83].  A  DOS  is  based  on  the  concept  of 
multilateral  control  and  intended  to  reside  on  a  physically  dispersed  computer  (e.g„  a  local  network-like 
architecture),  thereby  forming  a  decentralized  computer  system  (DCS)  [Davies  81].  DOS  design  is  a  new  area 
of  very  active  research  and  there  is  very  little  practical  experience;  therefore  much  is  to  be  gained  from  design 
and  implementation  experiments.  The  most  obvious  approach  to  obtaining  empirical  data  on  DOS’s  is  to 
construct  a  DCS  testbed  on  which  prototype  DOS's  and  various  DOS  concepts  can  be  implemented  and 
evaluated. 
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In  any  DOS  implementation  based  on  the  concept  of  multiprogramming  (or  cooperating  concurrent 
programs  in  general)  there  must  be  an  IPC  facility  with  which  the  constituent  processes  of  the  DOS  communi¬ 
cate.  It  is  dear  that  the  choice  of  an  IPC  facility  can  have  a  profound  influence  on  the  structure  of  the 
programming  systems  that  make  use  of  that  facility.  Therefore,  the  IPC  facility  for  a  DOS  testbed  system 
should  support  a  range  of  software  structures  that  might  be  used  in  constructing  DOS's.  However,  the 
software  structures  most  appropriate  for  DOS’s  have,  as  yet,  not  been  conclusively  identified.  This  suggests 
that  a  good  IPC  facility  for  a  DOS  testbed  system  would  be  one  that  permits  a  wide  a  range  of  different  IPC 
facilities  (and  hence  software  structures)  to  be  implemented  or  efficiently  emulated. 

As  a  result  of  research  on  the  fundamentals  of  DOS  design,  a  few  general  observations  can  be  made  on  the 
implications  of  DOS’s  on  IPC  facilities.  It  Is  clear  at  this  point  that  masier/slave  type  relationships  will  not  be 
the  predominant  form  of  process  structure,  but  rather  that  general  non-hierarchical  process-process  relation¬ 
ships  (e.g.,  collections  of  negotiating  peers)  will  be  most  common.  This  implies  that  synchronous  Send&  Wait 
or  procedure  call-oriented  IPC  facilities  will  be  less  appropriate  for  DOS’s  than  facilities  providing  message 
based,  N-party  communication  transactions.  Furthermore,  the  cooperative  nature  of  the  collective  decision¬ 
making  in  DOS’s  suggests  a  greater  amount  of  system-generated  communication  and  makes  a  greater  demand 
on  the  efficiency  of  the  IPC  facility  (and  its  underlying  implementation)  than  does  a  typical  local  area  network 
operating  system. 

4.2.3  Rationale 

IPC  is  a  highly  important  operating  system  facility  that  greatly  influences  the  structure  and  performance  of 
systems.  A  number  of  different  approaches  exist  for  providing  an  appropriate  IPC  facility  for  a  given  system. 
Among  these  approaches  are:  a  facility  implemented  with  a  highly  parameterized  interface,  a  facility  with  a 
strictly  layered  implementation,  and  a  facility  that  employs  a  policy/mechanism  separation  approach.  The 
policy /mechanism  separation  approach  holds  a  number  of  benefits  that  are  not  to  be  found  in  other  ap¬ 
proaches.  For  example,  separating  policy  and  mechanism  seems  to  be  an  exceptionally  good  method  of 
choosing  the  hardware/software  boundary  for  IPC  facilities,  and  of  determining  the  appropriate  primitives  to 
provide  (or  support)  in  hardware.  Through  the  separation  of  policy  and  mechanism  in  IPC.  it  may  also  be 
possible  to  have  different  IPC  facilities  simultaneously  coexist  in  a  system.  Furthermore,  by  separating  policy 
and  mechanism  an  IPC  facility  can  be  defined  that  meets  the  requirements  for  DOS  research,  in  a  manner 
superior  to  that  which  can  be  accomplished  through  alternative  approaches. 
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4.2.3. 1  Significance  of  Interprocess  Communication  Facility  Design  and  Implementation 


IPC  facilities  stand  out  as  special  in  comparison  to  other  operating  system  facilities.  An  IPC  facility  is 
typically  included  in  an  operating  system  kernel5,  and  recently  many  of  the  other  (non-kernel)  facilities  arc 
being  made  available  through  a  system's  IPC  facility.  The  special  role  that  IPC  plays  in  a  system  dearly  sets  it 
apart  as  a  facility  on  which  much  of  an  operating  system  can  be  constructed,  just  as  most  operating  systems  in 
the  past  were  built  on  memory  management.  The  degree  to  which  a  system  places  demands  on  an  IPC  facility 
(either  due  to  accessing  other  facilities  through  IPC,  or  due  to  process  communication)  influences  the  ef¬ 
ficiency  requirements  of  the  fadlity. 

The  choice  of  IPC  functionality  can  have  a  great  effect  on  the  structure  of  the  system  that  makes  use  of  it. 
For  example,  the  type  of  IPC  facility  provided  can  influence  the  forms  of  control  structures  possible  among 
cooperating  process.  This  can  be  seen  in  the  case  of  a  system  constructed  on  an  IPC  fadlity  that  provides  only 
synchronous  Send<t  Wait  and  Receive Jc  Wait  constructs  (similar  to  remote  procedure  call  semantics).  In  such 
a  system,  processes  ate  constrained  to  exhibit  coroutine-like  behavior,  where  there  is  only  a  single  point  of 
control  at  any  point  in  time  (thus  restricting  the  potential  for  concurrent  execution).  However,  it  might  be 
possible  for  a  process  in  such  a  system  to  spawn  concurrent  child  processes  that  could  carry  out  the 
synchronous  IPC  concurrently  with  the  execution  of  the  parent  process.  Thus,  either  the  control  relationships 
between  processes  are  unnecessarily  limited,  or  a  potentially  large  number  of  child  processes  must  be  intro¬ 
duced  to  simulate  the  desired  behavior,  adding  not  only  to  overhead  but  to  the  overall  system  complexity. 
Another  example  of  how  IPC  functionality  can  have  an  effect  on  systems  can  be  seen  in  the  impact  that  IPC 
exception  conditions  and  their  side-effects  can  have  on  system  design.  An  example  of  such  an  effect  is  the 
structuring  of  system  service  processes  and  operations  to  be  idempotent  in  order  to  cope  with  extraneous 
service  requests  resulting  from  replication  of  messages  in  the  IPC  facility. 

The  performance  of  an  IPC  fadlity  can  also  have  an  effect  on  system  structure  because  the  cost  of  com¬ 
munication  frequently  influences  the  design  and  partitioning  of  systems.  This  is  evident  in  the  fact  that 
virtually  all  distributed  system  software  is.  partitioned  according  to  minimum  communication  bandwidth,  as 
opposed  to  some  other  metric  (such  as  information  hiding  [Pamas  72]).  Entire  software  structuring  tech¬ 
niques  have  been  developed  in  response  to  the  relative  cost  of  inter-  versus  intra-process  communication,  or 
the  cost  of  local  versus  non-local  IPC.  Examples  of  such  structures  include  CLU  Guardians  [Liskov  81], 
Thoth  Pods  [Cheriton  79],  and  StarOS  Task  Forces  [Jones  79]. 


^ih  e  kernel  is  the  pan  of  the  operating  system  necessary  to  support  basic  abstractions  such  as  processes,  and  mask  undesirable  portions 
of  the  underlying  physical  hardware  ( 


4.2. 3. 2  Alternative  Approaches  to  Flexible  Interprocess  Communication  Facilities 

An  I  PC  facility  for  a  system  where  the  software  structure  is  not  well  defined  must  support  a  wide  range  of 
different  types  of  IPC  if  the  facility  is  not  to  adversely  impact  the  design  of  the  software  that  uses  the  IPC 
facility.  There  exist  a  number  of  alternative  approaches  to  achieving  such  a  flexible  IPC  facility,  and  the  most 
significant  of  these  arc  briefly  discussed  here. 

4.2.3.2.1  A  "Parameterized**  Approach 

One  approach  would  be  to  make  some  educated  guess  as  to  what  the  range  of  IPC  requirements  might  be 
and  specify  an  IPC  facility  that  can  meet  all  of  the  requirements.  This  approach  is  characterized  by  a  heavily 
parameterized  facility  interface,  whose  flexibility  derives  from  the  range  of  functions  achievable  via  this 
interface.  The  problems  with  such  an  approach  include  the  difficulty  of  ensuring  that  all  desired  facilities  can 
be  implemented,  and  the  logical  complexity  of  making  use  of  such  a  parameterized  facility. 

An  example  of  such  a  parameterized  interface  can  be  seen  in  the  MU 5  compiler  target  language  model 
(CTL)  [Barringer  79].  This  model  provides  an  abstract  machine  suitable  for  use  as  an  intermediate  language 
interface  to  a  collection  of  high  level  languages  (e.g^  Algol  60.  Algol  68.  PL/I,  Fortran,  etc.).  The  interface 
provided  by  CTL  was  a  highly  elaborate,  parameterized  interface  that  included  features  specialized  for  each 
of  the  languages  to  be  supported.  Experience  with  the  CTL  interface  showed  it  to  be  adequate  for  generating 
efficient  object  code.  However,  the  difficulty  of  using  the  highly  complex  interface  and  the  overall  poor 
performance  of  the  compilers  led  to  the  development  of  a  lower  level  interface  which  proved  easier  to 
implement  compilers  for,  and  generated  codes  were  more  efficient. 

4.2.3.2.2  A  "Strictly  Layered"  Approach 

An  approach  that  does  not  rely  on  an  a  priori  definition  of  the  requirements  for  an  IPC  facility  involves  the 
choice  of  "lowest  common  denominator"  type  of  low-level  facility.  The  problem  of  IPC  flexibility  is  dealt 
with  by  providing  the  simplest  possible  IPC  facility  out  of  which  a  range  of  higher-level  facilities  can  be 
constructed  (through  successive  layers  of  virtualization).  For  example,  such  a  fundamental  IPC  facility  might 
include  as  operations  a  Noir Blocking  Send  construct  and  a  Wait  for  Message  construct,  the  concatenation  of 
which  implements  a  Blocking  Send  construct  (at  a  higher  level  of  abstraction).  This  reduces  the  problem  of 
having  to  predict  all  possible  higher  level  facilities  in  an  a  priori  fashion  by  only  requiring  that  the  IPC  facility 
designer  ensure  it  is  possible  to  construct  the  desired  facility  from  the  given  lower  level  one.  However,  this 
approach  provides  flexibility  at  the  cost  of  performance:  each  successive  layer  of  virtualization  exacts  a  cost, 
which  accumulates  across  all  the  layers  and  negatively  affects  the  performance  of  the  higher-level  IPC 
facilities. 

In  addition  to  the  performance  penalties  incurred  in  a  strictly  layered  approach,  it  may  be  quite  difficult 


(and  occasionally  impossible)  to  construct  particular  functions  out  of  a  given  set  oflow-lcvci  operations.  For 
example,  implementing  a  Selective  Receive  construct  in  terms  of  a  simple  Receive  requires  multiple  message 
exchanges  and  a  great  deal  of  logical  complexity. 

4.2.3. 2.3  A  "Policy/Mechanism  Separation”  Approach 

The  policy/mcchanism  separation  approach  provides  a  method  of  decoupling  the  requirements  driven 
portions  of  a  specific  1PC  facility  from  the  generic  mechanisms  required  for  all  types  of  I  PC  facilities.  This 
technique  offers  a  means  of  implementing  a  generic  set  of  communications  mechanisms,  making  it  possible  to 
easily  implement  and  modify  arbitrary  IPC  facilities  through  different  organizations  of  invocations  of  the 
mechanisms  (i.c„  policies).  The  policy/mcchanism  separation  approach  differs  from  a  strictly  layered  ap- 
proach  in  that  the  requirements  driven  (hence  potentially  variable)  policy  decisions  are  implemented  directly 
on  top  of  a  collection  of  mechanisms,  as  opposed  to  being  constructed  out  of  an  arbitrary  number  of  succes¬ 
sively  higher-level  facilities.  Additionally,  while  the  policy/mechanism  separation  approach  creates  a  pair  of 
layers  (i.e_  the  policy  layer,  and  the  mechanism  layer),  the  policy  /mechanism  interface  does  not  necessarily 
provide  a  complete  facility.  It  is  only  through  the  invocation  of  the  mechanisms  in  accordance  with  a 
specified  policy  that  a  complete  IPC  facility  can  be  considered  to  exist.  This  is  as  opposed  to  a  strictly  layered 
approach,  which  provides  a  complete  (albeit  possibly  functionally  primitive)  IPC  facility  at  the  lowest  leveL 

While  the  policy  /mechanism  separation  approach  permits  the  direct  implementation  of  EPC  facilities  with¬ 
out  intervening  layers  of  functionality,  this  is  not  to  say  that  the  benefits  of  hierarchically  structured  function 
composition  cannot  be  used  in  the  construction  of  EPC  facilities  via  policy  and  mechanism  separation. 
Clearly,  the  more  levels  of  interpretation  required  to  provide  a  given  service,  the  poorer  the  performance  of 
the  ultimate  service  will  be.  This  suggests  that  the  policy  /mechanism  separation  approach  would  provide 
implementations  with  better  performance  characteristics  than  those  based  on  a  strictly  layered  approach. 
While  this  may  be  the  case,  it  should  also  be  dear  that  a  specialized  (and  monolithic)  implementation  of  a 
facility  will  most  always  have  greater  performance  than  a  facility  designed  to  be  highly  flexible.  We  explicitly 
acknowledge  this  fact,  and  willingly  accept  somewhat  sub-optimal  performance  in  return  for  flexibility. 

In  attempting  to  provide  a  flexible  IPC  facility  through  a  policy/mechanism  approach,  it  is  important  to 
define  mechanisms  to  simplify  the  design  and  implementation  of  the  policy  components  of  the  facilities  as 
much  as  possible.  This  can  be  done  at  the  expense  of  additional  complexity  in  the  mechanism  portion, 
because  the  cost  of  the  design  and  implementation  of  the  mechanisms  will  be  non-recurring,  while  the  cost  of 
implementing  different  policies  recurs  each  time  a  different  IPC  facility  is  created.  This  implies  that  the  IPC 
mechanisms’  level  of  functionality  should  be  raised  to  the  greatest  extent  possible,  without  having  the 
mechanisms  dictate  policy  in  any  way.  It  should  be  noted  that  restricting  the  range  of  policies  a  set  of 
mechanisms  can  cany  out,  can  be  thought  of  as  dictating  policy. 


It  is  apparent  that  in  such  an  effort  one  is  faced  with  a  problem  analogous  to  that  which  is  currently  at  the 
heart  of  the  instruction  set  architecture  (ISA)  debate  known  loosely  as  the  "RISC/CISC  argument.  This 
problem  revolves  around  the  attempt  to  optimize  a  number  of  attributes  (such  as  execution  speed,  code  size, 
implementation  complexity  of  a  vehicle  to  interpret  the  ISA.  etc.)  based  on  varying  the  level  of  functionality 
of  the  interface  provided  by  the  ISA.  The  interface  provided  by  a  set  of  IPC  mechanisms  is  subject  to  an 
argument  similar  to  one  found  in  the  RISC/CISC  debate.  If  high  level  of  functionality  mechanisms  do  not 
meet  the  exact  needs  of  a  policy  implcmentcr.  the  cost  of  achieving  the  desired  result  may  be  greater  than  that 
incurred  using  only  lower  level  mechanisms.  Gcarly.  this  requires  that  a  great  deal  of  effort  be  made  in 
determining  the  optimal  level  of  functionality  for  a  set  of  mechanisms.  The  choice  of  the  most  appropriate 
IPC  mechanism  interface  seems  somewhat  more  manageable  than  the  analogous  ISA  problem:  the  choice 
IPC  mechanism  interfaces  is  somewhat  simplified  by  the  fact  that  the  range  of  policies  to  be  implemented  can 
be  reasonably  well  defined. 

4.2. 3.3  Applying  Separation  of  Policy  and  Mechanism  to  Interprocess  Communication 

To  this  time,  there  have  been  no  attempts  (that  we  are  aware  of)  to  separate  policy  and  mechanism  in  IPC. 
This  is  despite  the  fact  that  IPC  facilities  can  be  expected  to  benefit  from  the  application  of  the  concept  of 
policy/mechanism  separation  in  ways  similar  to  those  achieved  with  other  operating  system  facilities. 

4.2.3.3.1  Providing  Flexible  IPC  Facilities 

Separating  policy  from  mechanism  yields  a  collection  of  functions  (mechanisms)  with  which  various  IPC 
facilities  can  be  implemented  by  changing  policies  (i.e_  many  specialized  IPC  facilities  can  be  implemented  in 
terms  of  a  single  set  of  mechanisms).  Policy/mechanism  separation  thus  results  in  highly  flexible  IPC 
facilities.  This  property  is  particularly  useful  for  a  testbed  system,  designed  for  experimentation  with  various 
(possibly  unforeseen)  operating  system  structures. 

Furthermore,  a  separation  of  policy  and  mechanism  in  an  IPC  facility  permits  the  mapping  of  a  set  of 
slightly  abstracted,  lower-level  mechanisms,  into  the  desired  higher-level  IPC  facility.  This  mapping  is  per¬ 
formed  without  the  overhead  incurred  by  traditional  multi-layered  implementations,  yet  does  not  abandon 
the  benefits  of  abstraction.  The  policy/mechanism  separation  approach  falls  between  multi-layered  im¬ 
plementations  (which  impose  a  substantial  cost  in  terms  of  interfacing  overhead),  and  monolithic  implemen¬ 
tations  (which  sacrifice  flexibility,  modularity,  and  intellectual  manageability). 

The  separation  of  policy  and  mechanism  also  allows  operating  system  designers  to  choose  IPC  policies  that 
are  specialized  for  particular  environments,  as  opposed  to  having  to  make  the  best  of  whatever  facility  is 
provided  by  the  kernel.  A  user  of  the  IPC  facility  can,  by  specifying  different  IPC  policies,  create  a  range  of 

custom  IPC  facilities.  This  customization  of  an  IPC  facility's  interface  is  done  once  for  each  new  IPC  facility 
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desired,  by  the  individuals  who  know  the  most  about  a  system's  requirements.  Furthermore,  the  IPC  inter¬ 
face  is  customized  in  such  a  way  as  not  to  obscure  or  restrict  the  power  of  the  underlying  mechanisms,  and  to 
insulate  the  IPC  facility  designer  from  the  full  complexity  of  the  underlying  physical  resources.  These  arc  all 
considered  to  be  desirable  characteristics  for  operating  system  facilities  [Lampson  83). 

4.2.3. 3. 2  Support  for  Multiple  Coexistent  IPC  Facilities 

In  an  IPC  facility  designed  according  to  a  policy/mcchanism  separation  approach,  multiple  types  (or 
versions)  of  IPC  facilities  could  simultaneously  coexist  in  a  system,  provided  that  they  do  not  place  conflicting 
demands  on  the  underlying  mechanisms.  The  policy/mcchanism  separation  approach  permits  a  decoupling 
of  multiple,  coexisting  IPC  facilities  that  is  not  possible  with  other  approaches. 

The  existence  of  multiple  IPC  facilities  in  a  system  would  clearly  require  that  processes  use  a  common  IPC 
facility  for  each  instance  of  IPC.  This  may  partition  the  processes  in  a  system  into  groups  according  to  the 
IPC  facilities  that  they  have  access  to.  Multiple  coexistent  IPC  facilities  are  currently  possible  by  other  means, 
however  in  most  all  extant  cases,  different  IPC  facilities  are  implemented  in  terms  of  some  other  IPC  facility 
(typically  in  a  layered  fashion).  In  a  facility  implemented  according  to  the  policy/mechanism  separation 
philosophy  the  differing  IPC  facilities  can  be  directly  implemented  by  different  policies,  thereby  eliminating 
the  potential  for  problems  due  to  circular  requirements  (Le^  a  facility  built  in  terms  of  another  facility,  which 
is  in  cum  built  in  terms  of  the  former  facility). 

4. 2.3. 3.3  Providing  Hardware  Support  for  IPC 

An  important  attribute  of  IPC  facilities  (and  the  one  that  receives  the  greatest  amount  of  attention)  is 
performance.  This  is  largely  due  to  the  fact  that  IPC  is  a  fundamental  facility  on  which  increasingly  more 
operating  systems  are  relying  on  to  an  increasingly  greater  extent6.  A  common  method  for  enhancing  the 
performance  of  IPC  facilities  has  been  to  reduce  the  amount  of  interpretation  involved  in  providing  the 
desired  facility.  We  have  pointed  out  that  a  policy/mechanism  separation  approach  provides  a  facility  on  top 
of  a  single  level  of  virtualization.  If  the  physical  resources  of  a  system  directly  implement  the  functionality 
specified  for  a  set  of  IPC  mechanisms,  an  IPC  facility  could  be  implemented  with  the  least  amount  of 
interpretation  (and  hence,  the  greatest  performance)  possible  without  making  the  sacrifices  associated  with 
monolithic  implementations.  Thus,  a  benefit  of  policy /mechanism  separation  in  IPC  is  a  promising  approach 
for  applying  inexpensive  hardware  (in  the  form  of  VLSI  components)  to  the  problem  of  providing  increased 
IPC  facility  performance,  without  restricting  flexibility. 


Si  e  emphasis  on  performance  may  also  be  attributed  to  the  [act  that  (as  in  computer  architecture)  it  is  significantly  easier  to  derive 
something  that  can  pass  as  a  measure  of  performance,  than  to  measure  the  similarly  important  attributes  of  modularity,  fault  tolerance, 
life-cycle  cost.  etc. 
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4.2. 3. 3. 4 IPC  for  Decentralized  Operating  System  Research 

The  property  or  flexibility,  without  the  typically  attendant  loss  of  performance,  is  a  major  benefit  of  IPC 
facility  design  and  implementation  based  on  policy/mechanism  separation.  This  attribute  is  extremely  useful 
in  an  environment  such  as  that  of  performing  empirical  research  on  operating  systems.  In  particular, 
decentralized  operating  system  (DOS)  research  (Davies  81}  poses  a  set  of  problems  that  make  unique 
demands  on  an  IPC  facility.  A  major  factor  in  this  type  of  research  is  the  lack  of  specific  knowledge  of  the 
structure  of  DOS's.  This  implies  the  need- for  an  IPC  facility  that  is  flexible  enough  to  permit  a  wide  range  of 
policies  to  be  implemented  in  the  support  of  DOS  experimentation.  This  need  for  flexibility,  along  with  the 
expectation  that  DOS's  will  place  significant  demands  on  a  system's  IPC  facility,  contribute  to  creating  a  pair 
of  conflicting  requirements  for  an  IPC  facility  —  Le..  both  flexibility  and  performance.  These  requirements 
not  only  correspond  to  the  expected  characteristics  of  an  IPC  facility  implemented  using  policy/mechanism 
separation,  but  also  suggest  that  other,  more  common  approaches,  are  less  well  suited  for  the  job. 

4.2.4  Related  Work 

Although  there  has  been  a  great  deal  of  work  in  the  general  area  of  IPC  [Northcutt  S3J.  relatively  little  of 
that  work  is  strongly  related  to  the  research- outlined  in  this  document.  Of  the  many  different  types  of  IPC 
facilities  that  exist  or  have  been  proposed,  few  have  had  flexibility  (in  the  sense  of  permitting  a  range  of 
different  facilities)  as  a  goal,  although  some  contend  that  their  facility  is  capable  of  implementing  a  wide 
range  of  IPC  policies  (Rao  80). 

There  have  been  very  few  explicit  attempts  at  applying  the  policy/mechanism  separation  approach  to  the 
design  and  implementadon  of  operating  system  facilities.  The  majority  of  the  work  in  this  area  has  been 
applied  to  other  operating  system  facilities,  such  as  paging,  protection,  and  scheduling  [Bernstein  71.  Levin 
75,  Ruschitzka  78].  To  the  best  of  our  knowledge,  there  are  no  instances  of  IPC  facilities  explicitly  designed 
and  implemented  according  to  the  principles  of  policy /mechanism  separation.  This  is  despite  the  fact  that 
some  IPC  facilities  consist  of  operations  known  as  "primitives"  [Liskov  79].  However,  much  work  has  been 
done  in  the  related,  general  area  of  abstraction  (e.g,  layering)  for  the  design  of  operating  systems  and  their 
facilities,  including  IPC  [Reid  80,  Zimmermann  80]. 

4.2.4. 1  Design  and  Implementation  of  Interprocess  Communication  Facilities 

The  IPC  taxonomy  described  as  an  anticipated  output  of  this  research  will  serve  as  an  illustration  of  the 
wide  range  of  the  IPC  facility  design  space,  while  the  previously  mentioned  IPC  model  will  represent  the 
various  implementations  possible.  From  these  efforts,  the  range  of  possible  IPC  facilities  should  be  apparent, 
in  addition  to  some  of  the  relationships  among  existing  IPC  facilities.  There  are  a  number  of  systems  that 
were  designed  to  permit  the  implementadon  of  a  broad  range  of  IPC  policies  (e.g„  [Fleisch  81}  and  [Rao  80]). 
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However,  these  systems  typically  provide  a  complete  IPC  facility  that  may  have  a  low  level  of  functionality, 
but  is  general  enough  to  permit  the  implementation  of  other  IPC  policies  in  terms  of  the  fundamental  one. 
Most  systems  do  not  make  an  effort  to  provide  flexibility  in  their  IPC  facilities  (in  the  same  sense  that  we  have 
previously  discussed),  and  some  of  those  that  do  suggest  that  it  be  achieved  by  providing  an  IPC  facility  with  a 
low  level  of  functionality  [Allchin  82). 

There  exist  a  number  of  systems  that  purportedly  support  their  IPC  facilities  with  hardware  [Cox  81,  Ford 
77,  Giloi  81.  Jensen  78b,  Jones  79).  The  large  proportion  of  these  facilities  are  supported  in  firmware  and  not 
strictly  in  hardware.  While  microcode  allows.  IPC  facilities  to  be  implemented  with  one  less  level  of  inter¬ 
pretation.  the  performance  benefits  are  typically  not  as  great  as  if  the  support  were  provided  directly  by 
hardware.  It  should  also  be  noted  that,  unlike  direct  hardware  support,  microcode  support  for  IPC  typically 
maps  the  structure  of  some  software  implementation  of  the  facility  directly  into  microcode.  (It  is  also 
interesting  to  note  that  there  also  exist  more  instances  of  applying  hardware  support  to  scheduling  than  to  IPC 
facilities.) 

4.2.4.2  Policy/Mechanism  Separation  in  Operating  System  Design  and  Implementation 

There  exist  very  few  examples  that  apply  the  concept  of  policy/mechanism  separation  in  operating  system 
design  and  implementation.  The  first  major  system  which  explicitly  embodied  these  ideas,  following  their 
inception  in  the  RC4000  multiprocessing  nucleus  [Brinch  Hansen  71],  was  the  Hydra/Cmmp  system  [Wulf 
74],  The  Hydra  operating  system  incorporated  the  principles  of  policy  /mechanism  separation,  and  attempted 
to  make  use  of  these  concepts  to  the  largest  extent  practical  in  an  actual  system  implementation.  However, 
even  in  Hydra  the  principle  of  policy /mechanism  separation  was  applied  to  only  those  facilities  that  most 
readily  lent  themselves  to  such  a  treatment,  and  the  intended  separation  of  policy  from  mechanism  in  the 
operating  system  was  (by  the  designer’s  own  admission)  not  complete.  In  subsequent  papers  on  the  design 
and  implementation  of  Hydra  [Levin  75],  it  is  acknowledged  that  some  policy  was  left  in  the  kernel  due  to  die 
performance  constraints  imposed  by  certain  portions  of  the  implementation.  For  example,  scheduling 
policies  cannot  be  earned  out  entirely  outside  the  kernel  as  the  cost  of  a  Hydra  protection  changing  proce¬ 
dure  call  was  too  expensive  to  be  incurred  each  time  a  scheduling  decision  is  to  be  made.  For  this  reason,  the 
kernel  contained  parameterized  policy  programs,  causing  the  separation  of  policy  and  mechanism  to  be 
incomplete.  In  addition.  Hydra  made  no  attempt  to  separate  policy  from  mechanism  in  other  cases  where  the 
cost  in  terms  of  performance  was  deemed  to  be  too  great.  Thus  in  Hydra,  the  only  facilities  whose  policy  and 
mechanism  were  (to  some  degree  or  other)  separated  were  scheduling,  paging,  and  protection.  In  retrospect, 
almost  all  that  was  accomplished  with  regard  to  policy/mechanism  separation  was  the  separation  of  long 
range  policy  from  short  range  policy.  Due  to  the  prohibitive  cost  of  crossing  protection  boundaries  to  make 
all  of  the  policy  decisions  separate  of  the  mechanisms.  Hydra  "mechanisms”  typically  carried  out  short  term 
policy  decisions  and  returned  to  "policy  modules"  where  longer  term  decisions  were  made. 


The  Intel  iAPX  432  implemented  what  can  be  thought  of  as  "Hydra  on  a  chip".  Because  it  was  in  pan  a 
VLSI  project  the  iAPX  432  did  not  have  the  same  restrictions  on  its  underlying  hardware  as  did  the  original 
Hydra  project  For  this  reason,  the  designers  of  the  iAPX  432  were  free  to  make  a  different  set  of  design  and 
implementation  tradeoffs.  However,  the  iAPX  432  separates  policy  and  mechanism  in  much  the  same  way 
and  to  the  same  extent  as  Hydra7.  This  is  despite  the  fact  that  the  same  objectives  of  policy/mechanism 
separation  held  for  the  iAPX  432  designers,  and  they  had  the  flexibility  to  provide  the  hardware  suppon 
needed  to  make  further  policy /mechanism  separations  practical. 

4.2.5  Approach 

This  section  defines  the  approach  that  will  be  taken  in  this  research  to  achieve  the  objectives  stated  in  earlier 
sections.  The  overall  approach  is  one  of  "outside-in"  development,  as  opposed  to  a  "bottom-up"  or  "top- 
down"  methodology.  This  research  will  be  guided  in  a  top-down  fashion  by  the  principles  of 
policy /mechanism  separation  along  with  the  results  of  the  IPC  facility  taxonomy  and  modeling  efforts,  while 
the  literature  survey  will  provide  the  raw  information  for  a  bottom-up  type  of  effort 

The  intent  of  this  research  is  to  explore  the  effects  of  applying  the  principles  of  policy /mechanism  separa¬ 
tion  to  [PC.  This  will  be  done  by  a  combination  of  conceptual  and  experimental  activities.  The  following  is  a 
roughly  chronological  ordering  of  the  currently  identifiable  events  which  will  contribute  to  this  research. 

4.2.5. 1  Survey  of  the  IPC  Literature 

In  order  to  achieve  a  solid  undemanding  of  the  breadth  of  possible  IPC  facilities,  their  implementations, 
and  their  system-level  implications,  a  survey  of  the  literature  will  be  performed.  This  literature  survey  will 
include  descriptions  of  existing  and  proposed  IPC  facilities,  discussions  of  general  operation  system  issues  that 
relate  to  IPC  facility  design  and  implementation,  and  papers  concerned  with  interprocessor  communication  in 
general  This  survey  will  not  be  confined  to  any  particular  time- frame  or  subset  of  the  IPC  design  space.  The 
result  of  this  survey  will  be  an  annotated  bibliography,  which  will  include  a  critical  analysis  of  each  of  the 
entries.  The  bibliography  described  here  will  serve  as  the  raw  material  that  provides  a  bottom-up  type  of 
impetus  to  this  research. 

4.2.5. 2  Taxonomy  of  tho  IPC  Design  Space 

Based  on  the  data  points  represented  in  the  bibliography  described  above,  a  taxonomical  structure  of  the 
IPC  design  space  will  be  created.  This  will  provide  a  structure  for  the  many  example  IPC  facilities  in  the 
literature,  provide  a  means  of  collapsing  these  many  examples  into  groups  which  are  isomorphic  with  respect 
to  their  relevant  features,  and  will  illustrate  the  breadth  of  the  IPC  design  space  (in  addition  to  possibly 

7Th»  ought  be  attributed  u>  the  (act  that  some  of  the  key  members  of  the  iAPX  432  project  had  previously  worked  oo  Hydra. 


revealing  unexplored  regions  of  (he  I  PC  design  space).  It  is  from  this  taxonomy  that  a  manageable  number  of 
specific  examples  can  be  chosen  to  represent  the  major  types  of  I  PC  facilities  for  use  in  experiments  that 
attempt  to  span  the  breadth  of  the  I  PC  design  space.  Furthermore,  this  process  of  providing  structure  to  the 
IPC  design  space  will  prove  valuable  in  the  later  process  of  defining  specific  IPC  primitives. 

4.2.5.3  Conceptual  Framework  for  Representing  IPC 

In  order  to  compare  and  to  evaluate  various  dissimilar  IPC  facilities  and  their  implementations,  it  is 
necessary  to  have  some  common  means  of  representing  IPC  in  a  system  context.  This  calls  for  the  develop¬ 
ment  of  a  simple  model  that  can  easily  represent  a  broad  range  of  IPC  facilities.  This  model  must  accom¬ 
modate  IPC  facilities  that  provide  different  functions,  are  implemented  in  different  ways,  and  exist  at  dif¬ 
ferent  levels  in  systems.  This  tool  will  aid  discussion  of  the  various  IPC  facilities  involved  in  this  research,  and 
will  be  useful  for  structuring  thought  about  IPC  facilities  -  how  they  are  implemented,  and  how  they  interact 
with  other  operation  system  facilities. 

4.2.5. 4  Evaluation  Criteria  and  Methodology  for  the  IPC  Primitives 

Prior  to  the  specification  of  a  collection  of  IPC  primitives,  a  means  of  determining  the  success  (or  failure)  of 
the  effort  must  be  developed.  This  will  be  accomplished  by  fust  generating  a  list  of  evaluation  criteria,  and 
then  indicating  a  methodology  for  obtaining  the  necessary  information  and  applying  the  criteria.  The  evalua- 
don  criteria  will  largely  be  derived  from  the  collection  of  anticipated  characteristics  of  the  use  of 
policy/mechanism  in  IPC  facility  design.  The  means  by  which  the  criteria  are  to  be  applied  must  also  be 
specified,  including  the  experiments  needed  to  derive  a  measure  of  each  of  the  characteristics  of  interest. 

4.2.5.5  Initial  Collection  of  IPC  Primitives 

At  this  point,  it  will  be  possible  to  derive  a  first  collection  of  primitives  that  will  constitute  a  complete  set  of 
IPC  mechanisms.  The  primitives  will  be  synthesized  based  on  experience  from  the  previous  tasks,  and  from 
the  distillation  of  the  many  example  IPC  facilities  into  a  collection  of  generic  IPC  activities.  The  generic  IPC 
activities  will  be  segregated  into  common  groups,  and  their  characteristics  will  be  evaluated  to  determine  if 
any  of  the  activities  could  be  subsumed  as  special  cases  of  more  general  activities.  The  collection  of  repre¬ 
sentative  IPC  facilities  can  be  viewed  as  manifestations  of  a  range  of  IPC  policy  decisions,  and  the  generic  IPC 
activities  are  to  be  transformed  into  IPC  primitives  that  permit  the  widest  possible  range  of  facilities  to  be 
implemented  (by  the  application  of  different  policies).  An  effort  will  be  made  to  ensure  that  the  separation  of 
policy  from  mechanism  be  as  pure  as  possible  (i.e„  no  policy  should  be  prescribed  by  the  generic  activities 
which  arc  made  into  primitives).  The  primitives  will  at  this  point  in  time,  consist  solely  of  the  descriptions  of 
their  interfaces  and  behaviors.  Thus,  the  implementation  of  the  primitives  should  not  be  a  factor  in  their 
specification,  and  implementation  artifacts  will  be  strenuously  avoided.  However,  the  descriptions  of  the  IPC 
primitives  must  be  sufficiently  detailed  to  permitcorrect  implementation  of  them  without  further  guidance. 


4.2. 5. 6  Trial  Implementation  of  the  Initial  IPC  Primitives 

Once  a  full  set  of  IPC  primitives  has  been  specified,  a  series  of  trial  implementations  will  take  place.  The 
purpose  of  these  studies  will  be  to  further  refine  the  proposed  primitives,  and  to  carry  out  a  wide  range  of  trial 
policy  implementations  for  the  purpose  of  determining  the  breadth  of  coverage  of  the  primitives.  These  trial 
implementations  will  consist  primarily  of  "paper  implementations",  in  order  to  maximize  the  number  of 
investigations  possible,  while  minimizing  the  effort  necessary  to  do  so.  These  experiments  will  also  serve  as 
yet  another  filtering  stage  on  the  set  of  example  facilities  (or  policies)  to  be  examined. 

4.2. 5. 7  Evaluation  and  Iteration  of  the  Initial  IPC  Primitives 

Based  on  the  trial  implementation  experiments,  the  initial  set  of  primitives  will  be  evaluated  according  to 
the  defined  methodology.  As  a  result  of  the  evaluations,  the  specifications  of  the  primitives  will  be  modified 
as  needed  and  the  implementation  phase  will  be  repeated.  This  iteration  will  continue  until  the  primitives  are 
considered  acceptable,  as  judged  by  the  evaluation  criteria. 

4.2.5.8  Detailed  implementation  and  Evaluation  of  the  IPC  Primitives 

The  resulting,  refined  set  of  IPC  primitives  will  be  evaluated  in  greater  depth  (although  in  lesser  breadth). 
These  primitives  will  be  implemented  in  on  a  local  network  of  personal  computers  (either  Sun’s  or  Perq’s)  for 
the  purpose  of  detailed  experimentation  and  analysis.  For  the  most  pari,  the  implementation  of  the  primitives 
will  be  in  a  high-level  language,  and  the  measurement  of  the  primitives  will  be  limited  to  that  which  is 
necessary  to  derive  the  information  required  by  the  evaluation  methodology.  A  small  number  of  different 
policies  will  be  implemented  in  this  series  of  experiments;  the  policies  chosen  will  attempt  to  span  the  widest 
range  of  interesting  IPC  facilities,  with  the  fewest  number  of  policies.  For  comparative  purposes  it  may  be 
desirable  to  implement  a  policy  similar  to  that  of  a  common  IPC  facility  implemented  in  some  other  fashion 
(e.g,  directly  implemented,  multi-layered,  etc.), 

4.2.5. 9  Investigation  of  Hardware  Support  for  the  IPC  Primitives 

The  specifications  of  the  final  set  of  IPC  facilities  may  be  further  refined  after  the  detailed  implementation 
studies.  In  any  event,  the  primitives  on  which  data  has  been  collected  in  these  studies  will  be  used  to  evaluate 
the  possibility  of  providing  hardware  support  for  them.  Various  implementation  alternatives  for  the  primi¬ 
tives  will  be  investigated,  ranging  from  predominately  software  to  entirely  hardware.  This  work  will  be 
carried  out  primarily  as  a  "paper  study",  and  will  make  an  effort  to  determine  the  cost/ performance  tradeoffs 
(across  the  range  of  practical  implementations)  for  each  of  the  primitives.  The  evaluation  of  providing 
hardware  support  for  the  primitives  will  be  based  on  the  measured  performance  of  the  detailed  implemen¬ 
tations  of  the  primitives,  the  relative  impact  of  the  efficiency  of  each  primitive  on  the  performance  of  the  IPC 
facility  created  by  given  policy,  and  the  cost  and  performance  of  hardware  support  for  the  primitives.  The 
proposed  hardware  support  mechanisms  will  be  specified  in  a  hardware  description  language. 
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4.2.6  Contributions 

This  research  will  result  in  a  set  of  IPC  mechanisms  that  support  the  implementation  of  a  wide  range  of  l  PC 
facilities.  Another  contribution  will  consist  of  an  evaluation  of  the  policy/mechanism  approach  to  l PC,  based 
on  implementations  of  the  previously  specified  mechanisms  and  a  chosen  set  of  IPC  policies.  An  additional 
contribution  will  be  a  determination  of  the  range  of  applicability  (and  constraints  on  the  use)  of  a 
policy /mechanism  separation  approach  to  IPC.  Further  contributions  of  this  research  include  a  taxonomy  of 
the  IPC  design  space,  and  a  logical  framework  to  represent  various  implementations  of  a  range  of  IPC 
facilities,  an  evaluation  of  the  degree  to  which  multiple  IPC  facilities  can  be  simultaneously  supported,  and 
whether  the  set  of  proposed  IPC  mechanisms  can  be  effectively  supported  with  hardware  (which  would 
include  descriptions  of  proposed  hardware  mechanisms). 

4.2.6. 1  Separation  of  Policy  and  Mechanism  in  Interprocess  Communication 

The  most  significant  contributions  of  this  work  are  expected  to  result  from  the  separation  of  policy  from 
mechanism  in  IPC,  the  creation  of  a  set  of  IPC  primitives,  the  implementation  and  evaluation  of  the  primi¬ 
tives,  and  an  evaluation  of  the  viability  of  policy/mechanism  separation  in  IPC  facility  design  and  implemen* 
tation.  In  this  section  we  discuss  each  of  these  topics. 


4.2. 6. 1.1  A  Collection  of  IPC  Primitives 

Part  of  the  overall  output  of  this  research  will  consist  of  the  specifications  for  a  collection  of  IPC  primitives. 
These  primitives  will  be  an  example  of  the  mechanisms  resulting  from  the  application  of  policy  /mechanism 
separation  to  the  implementation  of  an  IPC  facility.  This  exercise  will  be  particularly  valuable  as  there  exist  a 
wide  variety  of  commonly  known  IPC  policies,  but  no  examples  of  mechanisms  for  IPC.  The  primitives  will 
be  derived  based  on  an  understanding  of  the  range  of  possible  IPC  policies,  and  a  determination  of  a  set  of 
mechanisms  necessary  to  implement  a  wide  range  of  policies.  A  major  effort  will  be  made  in  defining  the 
functionality  of  these  primitives  to  maintain  a  separation  of  their  specification  from  their  implementation.  In 
addition  to  the  specification  of  each  primitive,  there  will  be  a  justification  for  eacn  of  the  primitives. 

4.2.6. 1 .2  Implementation  and  Evaluation  of  the  Primitives  and  Policies 

Additional  contributions  will  derive  from  the  implementation  of  the  IPC  primitives  and  a  selected  set  of 
IPC  policies.  The  resulting  implementations  will  be  measured,  analyzed,  and  documented.  These  implemen¬ 
tation  experiments  will  be  carried  out  on  a  local  network  of  Sun  Microsystems  workstations  or  Three  Rivers 
Perqs.  The  evaluation  of  the  primitives,  policies,  and  resulting  IPC  facilities  is  to  be  performed  according  to  a 
set  of  criteria  established  prior  to  the  implementation  work.  Much  of  the  success  of  this  portion  of  the  overall 
research  effort  will  be  judged  by  these  evaluations.  It  is  planned  that  the  evaluation  effort  will  determine  the 
degree  to  which  the  different  components  exhibit  the  behavior  expected  of  them,  and  how  the  implemen¬ 
tations  compare  to  implementations  using  other  approaches. 


4.2.6. 1 .3  Estimation  of  the  Suitability  of  a  Policy/Mechanism  Separation  Approach 

Another  expected  output  from  this  work  is  a  determination  of  the  overall  success  of  this  effort  to  separate 
policy  from  mechanism  in  IPC.  This  evaluation  is  intended  to  illustrate  the  conditions  under  which  a 
policy/mcchanism  approach  is  appropriate,  the  relative  cost/bcncfit  tradeoffs  of  the  approach,  and  the  cir¬ 
cumstances  to  which  this  approach  seems  best  suited.  Of  particular  interest  will  be  the  question  of  how  well 
the  separation  of  policy  from  mechanism  permits  an  IPC  facility  to  be  constructed  that  meets  the  needs  of  a 
DOS  testbed. 

4. 2. 6. 2  Applying  Structure  to  the  Interprocess  Communication  Design  Space 

This  portion  of  the  proposed  research  consists  of  two  main  components  -  a  taxonomically  structured 
representauon  of  the  IPC  facility  design  space,  and  a  logical  framework  to  illustrate  the  manner  in  which  IPC 
facilities  are  implemented  in  systems. 

4. 2. 6. 2.1  A  Taxonomy  of  Extant  IPC  Facilities 

This  work  will  generate  a  taxonomy-like  tree  structure  of  characteristics  that  will  present  a  logically  or¬ 
ganized  representation  of  the  space  of  existing  and  proposed  IPC  facilities  (as  represented  by  the  open 
literature).  This  structure  will  not  be  a  true  taxonomy  in  the  sense  of  providing  both  a  structure  and  an 
interpretation:  the  primary  goal  will  be  a  classification  to  illustrate  differences  and  similarities  in  the  many 
examples  taken  from  the  literature.  The  example  IPC  facilities  will  include  many  types  at  all  layers,  including 
those  in  specific  systems  and  those  proposed  independent  of  systems.  This  taxonomy  will  form  a  decision  tree 
that  will  provide  a  hierarchical  organization  of  the  instances  of  IPC  facilities  at  the  leaves.  Such  a  taxonomy 
will  illustrate  the  range  of  possible  IPC  facilities;  this  will  be  useful  both  in  determining  the  coverage  of  a 
flexible  IPC  facility,  and  in  deriving  generic  IPC  facility  classes. 

4.2.6. 2. 2  A  System  Model  of  IPC 

The  output  from  this  effort  will  be  a  logical  framework  for  structuring  thought  about  IPC  in  a  system 
context.  This  framework  is  loosely  referred  to  here  as  a  model.  This  model  is  necessary  as  there  currently 
does  not  exist  a  means  of  representing  the  functionality  of  IPC  facilities,  and  their  implementation,  in  the 
context  of  general  computer  systems.  The  proposed  model  is  to  be  used  to  gain  insight  into  the  nature  of 
specific  IPC  facilities,  to  help  in  evaluating  the  proposed  IPC  primitives,  and  to  aid  in  understanding  the 
implications  that  the  IPC  primitives  have  on  the  other  pans  of  the  system.  Unlike  more  formal  models,  this 
model  will  sacrifice  rigor  in  return  for  a  consistent  structure  that  directly  represents  the  concepts  of  interest. 
This  is  as  opposed  to  formal  models  that  are  sufficiently  expressive  to  represent  the  interesting  aspects  of  IPC 
facilities,  but  require  the  information  to  be  heavily  encoded  (and  hence  obscured)  by  the  notation. 
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4. 2. 6. 3  Exploring  the  Use  of  Hardware  Support  for  Interprocess  Communication 

The  primary  contribution  from  this  work  will  be  an  evaluation  of  the  proposed  IPC  primitives  to  determine 
their  suitability  to  being  supported  in  hardware.  Each  primitive  will  be  examined  individually,  and  an 
appropriate  degree  of  hardware  support  will  be  determined  for  each  one  based  on  cost/bcnefit  assessments 
(according  to  a  set  of  environmental  assumptions).  Where  hardware  support  for  primitives  is  determined  to 
be  of  the  greatest  value,  hardware  support  for  (or  implementations  of)  primitives  will  be  proposed.  The 
suggested  hardware  mechanisms  will  be  defined,  at  least  to  the  register  transfer  level,  by  a  hardware  descrip¬ 
tion  language  suitable  for  use  in  simulation  and  synthesis  efforts. 
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5.  DATE:  A  Decentralized  Algorithm  Testing 
Environment 

5.1  Overview 

The  experimental  aspect  of  our  research  occurs  at  both  the  algorithm  and  system/subsystem  levels.  It 
depends  on  two  complementary  components:  an  interim  testbed  (see  Chapter  7),  and  a  discrete  event 
simulator  named  DATE,  which  runs  on  VAX  UNIX. 

The  DATE  system  provides  a  simulation  environment  where  various  types  of  decentralized  algorithms  can 
be  evaluated.  Unlike  a  large  distributed  simulation  system,  DATE  was  implemented  based  on  a  simple  set  of 
primitives  (or  commands).  These  primitives  can  support  dynamic  creation  and  destruction  of  processes  and 
interprocess  communication  primitives. 

5.2  Design  and  Implementation  of  DATE 

5.2.1  Overview  of  DATE 

•  The  purpose  of  DATE  is  to  facilitate  the  experimentation  of  distributed  algorithms  in  a  well 
instrumented  distributed  environment. 

•  DATE  provides  a  set  of  mechanisms  to  the  user  (or  experimenter),  which  can  be  invoked  by 
primitives  (or  commands).  These  mechanisms  allow  the  user  to  set  up  the  distributed  system  on 
which  the  algorithms  are  to  be  tested,  and  provide  tools  for  experimenting  with  these  algorithms. 

•  The  algorithms  being  tested  can  be  expressed  in  one  of  two  ways.  The  actual  code  for  the 
algorithms  can  be  written  out.  or  their  behavior  can  be  simulated.  A  combination  of  the  two 
techniques  (part  emulation  and  part  simulation)  is  also  possible.  The  underlying  system  on  which 
these  algorithms  execute  is  simulated. 

•  The  motivation  behind  the  concepts  and  facilities  provided  in  DATE  arises  from  the  need  to 
experiment  with  distributed  algorithms,  especially  for  resource  management,  in  the  Archons 
project  The  primitives  have  been  selected  after  a  cursory  study  of  two  types  of  experiments 
which  will  be  performed  on  DATE.  However,  an  effort  has  been  made  to  allow  the  facility  to 
have  wider  applicability.  Although  there  has  been  no  attempt  to  provide  a  complete  set  of 
facilities  for  a  variety  of  potential  users,  it  is  expected  that  the  facility  can  be  extended  easily  to 
include  other  applications. 


5.2.2  Functional  Specification 


5.2.2. 1  Overview  of  Facilities 

•  DATE  provides  the  ability  to  concurrently  execute  multiple  user  processes  on  multiple  nodes  of  a 
simulated  global  bus  network. 

•  It  allows  the  dynamic  creation  and  destruction  of  user  processes,  and  communication  paths  be¬ 
tween  these  processes.  This  ability  is  useful  in  setting  up  the  underlying  system  on  which  the 
algorithm  is  executed,  as  well  as  in  implementing  the  algorithm. 

•  Processes  are  defined  statically.  At  the  time  of  the  creation  of  a  process,  its  code  must  be  con¬ 
tained  in  an  executable  object  file.  The  code  for  the  process  cannot  be  created  during  the  course 
of  an  experiment. 

•  DATE  provides  an  interprocess  message  communication  facility,  which  allows  three  different 
types  of  messages.  The  I  PC  characteristics  of  a  real  distributed  system  are  simulated.  Messages 
encounter  unpredictable  communication  delays.  The  delay  characteristics  can  be  varied  by  the 
experimenter.  The  communication  delay  for  inter-node  and  intra-node  messages  will  be  different 
in  general. 

•  It  provides  the  ability  to  set  up,  start  and  stop  an  experiment 

•  It  provides  a  recording  of  all  important  events  of  an  experiment  in  an  event  log  file.  Information 
required  about  a  run  of  an  experiment  can  be  recreated  from  this  file.  At  the  conclusion  of  an 
experiment  this  is  the  only  output  provided  by  DATE. 

•  It  allows  the  setting  of  breakpoints  in  the  experiment  These  breakpoints  can  be  set  at  specific 
points  in  the  code,  and  also  invoked  asynchronously  from  the  experimenter's  console.  At  any  of 
these  breakpoints,  the  experimenter  can  examine  the  state  of  the  system,  alter  parameters  and 
system  structure,  and  study  the  event  log  file. 

•  The  experimenter  can  write  a  postprocessor  to  extract  the  required  information  from  the  event  log 
file.  The  postprocessor  will  be  specific  to  a  particular  experiment,  or  to  a  class  of  experiments. 
The  number  of  such  routines  which  will  have  a  wider  applicability  is  unknown.  In  due  course  of 
time,  a  library  of  some  general  postprocessing  routines  may  become  available.  The  postprocessing 
routines  can  be  executed  either  at  the  conclusion  of  an  experiment,  or  at  breakpoints  during  an 
experiment. 

5. 2. 2. 2  A  Scenario  for  Experimentation 

A  typical  user  will  follow  the  steps  given  below  to  run  an  experiment  on  DATE 

•  The  code  for  each  type  of  process  that  is  to  be  created  during  the  course  of  the  experiment  is 
written  out  It  is  then  compiled  and  linked.  At  the  time  of  starting  DATE  the  code  for  each 
process  exists  in  a  separate  object  file. 

•  The  postprocessing  routines  for  the  experiment  arc  written  and  compiled. 

•  DATE  is  started  up.  At  this  time  it  is  in  command  mode,  and  prompts  the  experimenter  for 
instructions. 
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•  The  experimenter  sees  up  the  initial  system  configuration  and  the  experiment  by  creating 
processes  and  communication  paths  to  interconnect  those  processes.  This  can  be  achieved  directly 
from  the  terminal,  or  by  creating  a  process  which  sets  up  the  system. 

•  DATE  is  allowed  to  run  the  experiment  for  a  specified  length  of  time,  it  is  now  in  run  mode,  and 
does  not  respond  to  the  experimenter  in  this  mode. 

•  The  experimenter  can  asynchronously  interrupt  the  experiment  at  any  time,  to  bring  it  back  to 
command  mode.  This  will  give  the  control  back  to  him  for  any  interaction  with  DATE  Break¬ 
points  can  also  be  set  in  the  process  code.  Their  effect  is  the  same  as  of  the  interrupt:  i.e„  the 
experimenter  can  interact  with  DATE  again. 

•  After  interrupting  DATE  (in  either  of  the  two  ways  described  above),  the  experimenter  can  use 
any  of  the  primitives  available,  modify  parameters,  change  the  system  structure  etc.  The 
postprocessing  routines  can  also  be  run  on  the  event  log  file  built  upto  this  point  in  the  experi¬ 
ment 

•  At  the  end  of  the  experiment,  the  DATE  system  can  be  terminated.  The  postprocessor  can  now 
work  on  the  log  file  left  by  DATE 

•  If  the  system  gets  wedged  in  the  course  of  an  experiment,  the  entire  system  can  be  killed  by 
sending  an  appropriate  signal. 

5.2.2.3  Detailed  Specification 

•  DATE  is  a  message  based  system.  It  provides  the  ability  to  dynamically  create  and  destroy 
processes  and  communication  paths.  These  processes  and  communication  paths  can  be  used  to 
simulate  the  underlying  distributed  system  being  experimented  with,  and  for  implementing  dis¬ 
tributed  algorithms. 

•  The  definition  of  a  process  is  static.  The  code  for  all  types  of  processes  which  will  be  created 
during  an  experiment  has  to  be  provided  in  advance.  The  definition  of  a  process  is  known  as  a 
template.  A  template  gives  the  type  of  a  process.  When  a  new  process  is  created,  its  type  or 
template  has  to  be  specified. 

•  Each  process  is  associated  with  a  node.  This  is  to  enable  the  communication  system  to  model  the 
delay  characteristics  of  messages  more  accurately.  On  an  average,  interprocess  messages  on  the 
same  node  will  have  shorter  transit  times  than  those  across  nodes.  The  nodes  are  assumed  to  be 
connected  by  buses. 

•  The  IPC  mechanism  provides  three  types  of  message  communication. 

o  Direct:  Single  sender,  single  receiver. 

o  Broadcast:  Single  sender,  multiple  receivers. 

o  Selector  Single  sender,  any  one  of  a  set  of  possible  receivers. 

In  effect,  the  IPC  mechanism  provides  various  communication  paths  for  each  process.  The  sender 
need  not  necessarily  know  the  type  of  message  being  sent 


•  Broadcast  and  selector  types  of  messages  arc  provided  by  associating  processes  with  broadcast  sets 
and  selector  sets,  if  a  particular  message  has  to  be  broadcast  it  is  sent  to  the  appropriate  broadcast 
set.  and  received  by  ail  members  of  that  set.  Similarly,  a  selector  message  is  sent  to  a  particular 
selector  set.  and  received  by  any  one  of  its  members  chosen  randomly.  Membership  of  a  set 
defines  a  communication  path  for  a  process,  on  which  other  processes  can  send  messages  to  it. 
Each  process  can  belong  to  multiple  broadcast  and  selector  sets  simultaneously. 

•  Direct  messages  can  be  sent  to  any  process  whose  ID  is  known. 

•  The  IDs  of  processes  and  sets  provide  all  the  communication  paths  in  the  system. 

•  Each  process  is  associated  with  a  single  mailbox  on  which  it  receives  messages.  Mailboxes  have 
fixed  sizes  which  can  be  set  at  the  time  of  the  creation  of  the  processes.  It  is  also  possible  to  have 
no  limit  on  the  size  of  a  mailbox. 

•  Messages  are  prioritized.  Priorities  define  the  order  in  which  messages  are  to  be  received.  Within 
a  priority,  the  order  is  first  come  first  serve.  A  message  is  preemptible.  Le..  it  can  be  discarded  to 
make  room  for  a  higher  priority  message,  in  case  the  receiver's  mailbox  overflows.  The  prob¬ 
ability  of  a  message  being  discarded  can  be  reduced  by  assigning  a  higher  priority  to  it  The 
current  system  provides  32  levels  of  priority,  the  highest  being  1  and  the  lowest  31 

•  The  underlying  communication  system  simulated  by  DATE  provides  random  delays  for  all  mes¬ 
sages  beng  sent  on  the  network  bus.  At  present,  these  delays  do  not  depend  on  the  current  system 
conditions. 

•  The  message  passing  system  does  not  include  any  facilities  for  protection.  The  system  is  assumed 
to  be  co-operative,  not  competitive.  All  the  processes  are  implemented  by  the  same  user,  and 
need  not  be  protected  against  each  other.  The  facility  will  be  used  for  experimenting  with 
operating  system  level  algorithms.  Protection  for  these  algorithms  is  not  essential. 

•  The  following  primitives  are  provided  by  DATE 

o  CreateNode 

This  primitive  creates  a  new  node,  and  returns  its  NodelD. 

o  DestroyNode  (NodelD) 

This  primitive  removes  the  specified  node  from  the  system,  and  destroys  all  processes  on 
that  node.  It  can  be  used  for  implementing  a  processor  crash  or  a  node  failure  of  a  real 
system. 

o  CreateProcess  (TempiatelD,  NodelD,  MailboxSize) 

This  primitive  creates  a  new  process,  and  returns  its  ProcessID.  The  type  of  process  to  be 
created  is  specified,  along  with  the  node  on  which  it  is  created,  and  the  size  of  its  mailbox. 
Information  regarding  various  parameters  and  communication  paths  to  be  used  by  the  new 
process  is  sent  to  it  by  messages. 

o  Destroy  Process  (ProcessID) 

This  primitive  removes  the  specified  process  from  the  entire  system.  This  includes  its 
removal  from  its  node,  and  the  sets  to  which  it  belonged.  At  this  time,  some  statistics 
concerning  the  process  being  destroyed,  are  recorded  in  the  log  file.  These  include  the 
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execution  of  the  process,  and  the  number  of  primitive  calls.  'ITiis  information  can  be  used  in 
calculating  a  better  estimate  of  inter-primitive  execution  time  for  the  process  in  a  sub¬ 
sequent  run  of  the  experiment. 

o  CreateSet  (SetType) 

This  primitive  creates  a  new  set  of  the  specified  type  (broadcast  or  selector),  and  returns  its 
SctlD.  This  primitive  is  used  for  creating  a  new  communication  path. 

o  DestroySet  (SctlD) 

This  primitive  destroys  the  specified  communication  path, 
o  Add  Element  (SctlD.  ProcessID) 

This  primitive  is  used  for  adding  a  process  to  an  existing  broadcast  or  selector  set. 
o  RcmoveElement  (SctlD,  ProcessID) 

This  primitive  is  used  for  removing  a  process  from  an  existing  broadcast  or  selector  set. 

o  Send  (RcceivcrID,  Priority,  Length,  MsgContent) 

This  primitive  allows  a  user  process  to  send  messages  to  one  or  more  user  processes.  All 
three  types  of  message  communication,  viz*  direct  broadcast  and  selector  are  specified  in 
the  same  way.  The  DATE  system  understands  the  type  of  message  communication  from  the 
ReceiverlD.  The  ReceiverlD  can  either  be  a  SetID  or  a  ProcessID.  DATE  determines  the 
type  of  ID  and  hence  the  type  of  message  communication  specified.  The  priority  of  the 
message,  and  its  length  in  number  of  bytes  are  also  specified.  MsgContent  is  a  pointer  to  a 
buffer  in  which  the  message  is  stored.  The  send  primitive  is  not  a  blocking  send.  The  caller 
does  not  receive  any  indication  of  whether  the  message  reached  its  destination.  He  would 
have  to  enquire  about  the  message  by  an  end-to-end  protocol  In  a  real  distributed  system, 
it  is  reasonable  to  assume  that  the  IPC  facility  is  unable  to  inform  the  sender  about  the 
precise  state  of  his  message  (if  it  reached  the  receiver’s  mailbox,  if  the  receiver  saw  the 
message  etc). 

o  Receive  (Timeout,  MsgPointer) 

This  primitive  allows  a  user  process  to  receive  messages  from  its  mailbox.  The  process 
blocks  for  at  most  "timeout"  number  of  seconds,  waiting  for  a  message  to  arrive  at  its 
mailbox.  The  value  of  timeout  can  be  set  to  zero,  if  the  user  process  wishes  to  poll  for  a 
message.  The  length  of  the  message  is  returned  as  the  value  of  the  function.  If  timeout 
occurred  and  no  messages  were  received,  an  error  value  is  returned.  If  a  message  is  received, 
its  content  is  placed  in  the  buffer  pointed  to  by  MsgPointer.  Each  user  process  is  associated 
with  a  single  prioritized  mailbox.  The  receive  command  returns  the  first  highest  priority 
message  to  arrive  in  the  mailbox. 

o  SetParamcter  (ParameterName,  ParameterValue) 

This  primitive  allows  the  experimenter  to  define  the  value  of  some  parameters  in  DATE, 
such  as  the  execution  time  of  various  primitives,  and  characteristics  of  the  message  com¬ 
munication  system.  The  parameters  which  can  be  set  in  this  way  are  a  well  defined  part  of 
DATE’S  interface  to  user  processes. 

o  Display  (ParameterName) 

This  primitive  allows  an  experimenter  to  see  the  current  value  of  a  system  parameter  on  his 
console.  It  is  useful  in  conjunction  with  the  SetParametcr  primitive. 


o  Breakpoint  (length.  MsgContcnt) 

This  primitive  is  used  for  setting  breakpoints  in  the  code  for  user  processes.  Once  a  break  is 
encountered,  the  OATH  system  stops  running  the  simulation,  and  waits  for  commands  from 
the  experimenter  console.  This  gives  the  experimenter  the  opportunity  to  modify  the  system 
(create  and  destroy  user  processes,  communication  paths  etc)  and  system  parameters.  He 
can  also  examine  the  log  file,  and  run  postprocessing  routines  on  it.  In  its  effect,  a  break¬ 
point  is  identical  with  sending  a  message  to  the  experimenter’s  console.  The  priority  of  the 
message  is  assumed  to  be  the  highest  (one),  and  the  contents  of  the  message  are  typed  out  at 
the  console. 

o  Interrupt 

This  primitive  is  invoked  from  the  experimenter’s  console  by  depressing  the  <DEL>  charac¬ 
ter  on  the  ASCII  keyboard.  This  provides  an  asynchronous  breakpoint  The  effect  of  the 
interrupt  is  the  same  as  that  of  Breakpoint,  i.e.  of  focusing  the  attention  of  DATE  on  die 
experimenter  console.  The  message  sent  is  the  single  word  "Interrupt". 

o  Record  (Length,  Content) 

This  primitive  is  analogous  to  the  "Write"  statement  of  a  programming  language.  The 
contents  specified  are  written  out  in  the  event  log  hie. 

o  Terminate 

This  primidve  is  used  by  the  experimenter  to  terminate  the  entire  DATE  system.  Statistics 
related  to  the  execution  time  and  number  of  primitive  calls  of  all  the  user  processes  are 
entered  in  the  log  hie. 

o  Init  (ExecurionTime,  NodelD) 

This  primitive  must  be  executed  by  every  user  process  on  startup.  It  synchronizes  the  newly 
created  process  with  the  rest  of  the  simulation  system.  ExccutionTime  gives  an  estimate  of 
the  average  time  taken  to  execute  the  instructions  between  two  consecutive  primitives  in 
that  process.  The  Init  function  returns  the  ID  of  this  new  process,  and  also  the  ID  of  its 
node. 


5.2.3  A  Sketch  of  the  Implementation 

5.2.3. 1  The  Structure  of  DATE 
In  this  section,  the  units  comprising  DATE  are  described  briefly. 

•  A  central  controller  process  called  Controller  forms  the  heart  of  the  system.  It  simulates  the 
concurrent  execution  of  user  processes,  and  provides  the  interprocess  communication  mechanism. 
It  also  provides  the  various  primitives  described  in  the  previous  section.  It  is  responsible  for 
running  the  simulation,  recording  events,  simulating  the  underlying  message  communication  sys¬ 
tem.  starting  the  system  etc.  The  main  data  structures  it  consists  of  are  the  following: 

o  A  queue  of  events  called  the  EventQ  is  provided.  In  this  queue,  events  are  queued  for  <*arh 
of  the  user  processes  (including  the  Interface  process).  These  events  are  to  cake  place  in  the 
future,  with  respect  to  simulated  time.  The  information  given  by  each  entry  in  the  queue  is 
the  name  of  the  event,  the  value  of  simulated  time  at  which  it  is  to  occur,  the  ID  of  the 
process  waiting  for  the  completion  of  the  event  (if  any),  and  any  parameters  related  to  the 
event  (e.g.  a  pointer  to  the  message  in  the  case  of  a  Send  even'}- 


o  The  controller  has  tables  containing  information  about  the  existing  user  processes  and  com* 
munication  paths.  Information  about  a  process  includes  the  ID  of  the  node  on  which  it 
executes,  the  size  of  its  mailbox,  and  its  status  (blocked  or  active).  A  list  of  the  members  of 
each  set  is  also  maintained. 

o  Mailboxes  are  maintained  for  processes  (one  per  process)  (Tom  which  they  can  receive 
messages.  A  mailbox  is  a  prioritized  queue,  with  an  upper  limit  on  the  number  of  entries 
allowed.  Each  entry  in  the  queue  contains  the  priority  of  a  message,  and  a  pointer  to  the 
message  buffer  area  where  the  message  resides. 

o  A  message  buffer  area  is  maintained,  in  which  messages  are  stored  from  the  time  a  send 
event  is  queued  in  the  event  queue  upto  the  time  the  message  is  received  (or  discarded  from 
the  system).  Besides  the  message  content,  the  buffer  stores  the  ID  of  the  sender,  the  receiver 
and  the  message. 

•  An  Experimenter's  Interface  process  is  provided  to  enable  the  experimenter  at  the  console  to 
interact  with  the  Controller  process.  The  Interface  process  is  very  similar  to  a  user  process.  The 
controller  provides  the  same  primitives  to  it,  as  described  for  the  user  processes.  However,  in 
some  ways  it  acts  somewhat  differently,  e.g.,  messages  sent  to  or  received  from  it  do  not  take  any 
transit  time  (in  terms  of  simulated  time).  The  interface  process  provides  a  command  interpreter. 

It  accepts  the  commands  from  the  experimenter,  and  calls  the  appropriate  library  subroutines  for 
communicating  those  commands  to  the  Controller.  In  the  current  version  of  DATE,  the  com* 
mand  interpreter  Cl,  implemented  on  the  UNIX  system  at  Carnegie-Mellon  University,  is  used. 

The  Interface  process  also  communicates  the  information  received  from  the  controller  to  the 
experimenter. 

•  The  library  subroutines  which  are  called  by  the  user  processes  (as  well  as  the  interface  process)  are 
part  of  the  DATE  system.  These  subroutines  provide  a  user  process’s  interface  to  the  controller 
process.  They  hide  the  details  of  the  UNIX  operating  system,  and  the  implementation  of  DATE 
from  the  user.  The  two  main  tasks  performed  by  these  subroutines  are: 

o  The  handling  of  the  UNIX  pipes  which  provide  communication  between  the  user  process 
and  the  central  controller. 

o  The  handling  of  execution  time  of  user  processes.  This  value  of  time  is  used  by  the  central 
controller  in  deciding  the  simulated  time  at  which  an  event  queued  by  the  user  process  is  to 
occur.  In  the  present  version,  the  library  routines  keep  track  of  the  number  of  primitives 
executed,  and  the  total  CPU  time  of  the  user  process,  to  give  an  estimate  of  the  inter* 
primitive  execution  time  of  the  process. 

•  An  event  log  file  is  maintained  by  DATE  which  is  a  dump  of  all  the  events  taking  place. 

5.2.3. 2  Implementation  of  Simulation 

A  discrete  event  simulation  of  the  distributed  environment  is  performed.  The  entire  system  works  in 
"lock-step",  such  that  at  any  point  in  real  time,  only  one  process  is  executing.  Parallelism  is  implemented  by 
appropriate  handling  of  simulated  time.  The  central  controller  allows  any  one  user  process  to  execute  at  a 
time.  When  the  executing  process  makes  a  call  to  the  DATE  system,  the  request  is  entered  in  an  event  queue. 
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and  the  process  is  suspended.  Now,  some  other  process  is  allowed  to  execute,  and  so  on.  The  event  queue 
consists  of  time  ordered  events.  The  process  chosen  for  execution  is  the  requestor  of  the  event  at  the  head  of 
the  event  queue. 


The  basic  simulation  is  described  below.  It  consists  of  taking  events  from  the  head  of  the  event  queue, 
executing  the  primitives,  unblocking  the  processes  waiting  for  those  events  to  complete,  getting  new  events 
from  those  processes,  and  queuing  diem  .in  the  event  queue.  The  various  steps  taken  are  described  below  in 
some  detail,  especially  with  respect  to  the  handling  of  simulation  time. 

•  Remove  the  event  from  the  head  of  the  event  queue.  Call  it  El. 

•  Let  the  simulated  time  at  which  El  is  to  occur  be  Tl.  Check  the  current  value  of  the  simulated 
time  clock1.  If  the  value  is  less  than  Tl,  then  update  the  clock  value  to  Tl. 

•  Record  the  current  value  of  simulated  time,  the  name  of  the  event,  and  the  ID  of  the  process 
requesting  it,  in  the  event  log  file. 

•  Call  the  procedure  FI  which  handles  El  type  of  events,  and  send  it  all  the  parameters  associated 
with  El.  Besides  executing  the  required  function,  the  procedure  will  also  find  an  estimate  of  the 
time  taken  for  that  function  to  execute.  Let  this  value  of  time  be  Tl2 

•  FI  sends  a  message  to  the  process  PI.  which  has  queued  the  event.  EL  and  has  been  waiting  since 
for  its  completion. 

•  FI  also  records  the  parameters  of  the  event  El  in  the  log  file,  along  with  the  outcome  and  results 
of  the  event. 

•  Wait  for  a  message  from  process  PI,  giving  the  next  event  E2  to  be  queued  for  it  Process  PI  will 
also  send  along  the  value  T4  of  the  time  interval  between  the  occurrence  of  the  two  events,  El  and 
E2.  The  method  of  finding  T4  is  explained  later  in  this  section. 

•  Find  the  value  of  simulated  time  T5  at  which  the  event  E2  is  to  occur,  by  adding  T3  =  T2  +  T4 
to  Tl  (to  get  simulated  time  TS). 

•  Enter  E2  in  the  event  queue  in  priority  order  according  to  the  value  TS. 

•  Repeat  the  above  operations  till  the  event  clock  reaches  the  value  set  by  the  experimenter  for  the 
termination  of  the  simulation. 


The  estimate  of  the  time  interval  between  two  consecutive  events  is  found  by  any  user  process  in  one  of  two 
ways.  The  two  options  are  known  as  Default  and  Override. 


2It  has  to  be  either  less  than  Tl  or  equal  to  TL 

e  tune  taken  to  execute  each  event  of  type  El  need  not  necessarily  be  the  same.  The  procedure  can  find  the  time  by  using  a 
distribution  function,  which  returns  different  values  for  the  various  calls  to  the  procedure.  In  the  present  implementation,  however,  the 
execution  time  is  a  constant,  settable  by  the  user. 
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In  the  default  option,  the  intention  is  to  determine  the  time  interval  by  finding  the  real  elapsed  time  for  the 
CPU  to  execute  the  code  between  the  two  events  (El  and  E2  in  the  previous  example).  The  default  calcula¬ 
tion  is  done  by  the  library  subroutines  which  interface  the  user  process  to  the  rest  of  the  DATE  system.  This 
calculation  is  not  visible  to  the  code  of  the  user  process.  The  default  mode  is  based  on  the  assumption  that  the 
user  process  contains  the  precise  code  to  be  experimented  with  (and  not  a  simulation  of  its  action).  In  the 
current  implementation,  an  estimate  of  the  time  interval  between  two  events  is  used,  owing  to  the  lack  of  a 
high  resolution  clock  for  measuring  elapsed  CPU  time  on  the  UNIX  system.  The  library  routines  keep  track 
of  the  total  execution  time  of  a  user  process,  as  well  as  the  number  of  DATE  system  calls.  These  values  are 
recorded  in  the  event  log  file,  when  the  user  process  is  destroyed.  In  a  subsequent  run  of  the  experiment,  the 
experimenter  can  use  the  average  value  of  inter-primitive  time  calculated  from  this  data,  to  give  a  good 
estimate  of  the  elapsed  CPU  time  between  two  events. 

In  the  override  option,  the  time  interval  is  a  simulated  value,  assigned  by  an  explicit  call  to  a  library  routine, 
by  the  user  process.  This  option  is  useful  when  the  user  process  is  simulating  the  behavior  of  an  algorithm  or 
device. 

The  Interface  process  works  in  the  override  mode,  with  the  value  of  the  time  interval  between  two  events 
always  being  given  as  zero.  The  interface  process  is  treated  somewhat  differently  from  other  user  processes  by 
the  DATE  system  as  welL  DATE  assumes  that  events  take  zero  execution  time  when  queued  by  the  interface 
process.  As  a  result  of  these  two  facilities,  the  interface  process  "hogs"  the  controller  once  it  gains  access  to  it, 
because  all  its  events  get  queued  for  the  current  value  of  simulation  time,  near  the  head  of  the  event  queue.3 

In  effect,  in  Command  mode,  each  command  is  executed  by  using  the  same  simulation  mechanism  (event 
queue,  event  handling  procedures,  etc),  as  used  in  Run  mode,  but  the  simulated  time  does  not  advance,  so 
that  the  commands  are  not  a  part  of  the  simulation.  This  occurs  at  the  start  and  conclusion  of  experiments,  as 
well  as  at  breakpoints.  The  interface  process  relinquishes  control  by  queuing  a  Receive  event  with  a  timeout 
value  of  the  end  of  the  simulation  run. 

5.2.4  Present  Status 

The  system  has  been  implemented  and  been  working  at  the  level  described  in  this  Section  since  May  83. 
The  system  is  currently  being  ported  from  VAX  to  SUN  workstations  by  NOSC. 


3Note  that  we  do  not  need  to  worry  about  other  user  processes  trying  to  "hog"  the  system,  because  we  are  implementing  a  co-operating 
system.  The  writer  of  all  the  user  code,  and  the  experimenter  at  the  console  will  normally  the  same  person. 
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6.  Decentralized  Computer  Architecture 


6.1  Overview 

One  of  our  tenets  is  that  our  unconventional  decentralized  operating  system  (OS)  ought  to  be  reflected  in 
the  architecture  of  the  nodes  and  intemode  connection  facility  of  the  decentralized  computer.  (Even  when 
the  nodes  and  their  interconnection  are  preordained  without  consideration  of  the  decentralized  OS. 
knowledge  of  this  OS/hardware  interaction  enables  one  to  predict  the  system’s  suboptimal  behavior.) 

The  first  step  is  to  migrate  the  global  resource  management  from  the  application  portion  of  each  node  into  a 
dedicated  special  purpose  machine  at  each  node,  designed  expressly  for  executing  decentralized  global 
resource  management  efficiently,  yet  without  taking  cycles  from  the  users.  Several  critical  issues  must  be 
resolved  in  order  to  do  this;  one  is  concerned  with  exploitation  of  OS  and  application  processing,  a  task  we 
have  begun,  as  seen  in  Section  62. 

Another  issue  is  our  controversial  position  that  a  complex  instruction  set  processor  is  not  intrinsically 
without  merit,  as  a  few  computer  designers  have  recently  argued;  this  debate  is  discussed  in  Sections  63  and 
6.4.  Other  issues  remain  untouched  this  contractual  period,  but  plans  are  being  made  for  dealing  with  them 
as  soon  as  possible. 

6.2  Separation  of  OS  and  Application  Processing 

6.2.1  Concurrency  Techniques 

In  this  section  we  discuss  in  turn  each  of  five  classes  of  operating  system  functions.  At  the  same  time  we 
will  provide  a  number  of  examples  of  the  ways  in  which  concurrency  of  operating  system  and  application 
processing  can  be  exploited,  and  indicate  some  of  the  architectural  issues  involved. 

6.2. 1.1  Processes,  Synchronization,  and  Communication 

One  of  the  most  fundamental  and  important  abstractions  supported  by  almost  all  modem  operating  systems 
is  that  of  a  process.  Gosely  associated  with  processes,  and  often  bound  up  in  their  definition,  are  mechanisms 
for  synchronizing  processes  and  performing  interprocess  communication.  The  literature  contains  abundant 
examples  of  proposed  and  existing  systems  which  provide  some  form  of  hardware  support  for  more  efficiently 
implementing  the  process  abstracuon  (see  [WendorfJ  83)  for  a  survey).  A  number  of  these  systems  exploit 
concurrency  of  operating  system  and  application  processing  to  some  degree.  In  particular,  it  is  relatively 
common  for  systems  to  perform  the  process  scheduling  task  on  a  separate  processor  from  that  on  which  the 
application  processes  are  executed. 


Process  scheduling  involves  determining,  on  the  basis  of  priority,  waiting  time,  or  whatever,  which  process, 
from  the  set  of  processes  that  arc  ready  to  run.  will  next  be  executed  on  the  Application  Subsystem  (AS).  I  he 
computation  required  to  determine  which  process  to  run  next  on  the  AS  can,  at  least  in  theory,  be  performed 
on  the  Operating  System  Subsystem  (OSS),  and  overlapped  with  execution  of  the  current  process  on  the  AS. 
To  do  this  in  practice  requires  a  fairly  tight  coupling  between  the  OSS  and  AS.  One  approach  might  be  to 
have  the  OSS  and  AS  share  the  memory  containing  the  process  control  blocks,  which  hold  the  current  status 
and  saved  volatile  state  for  the  processes  that  are  being  executed  on  the  AS.  Only  the  OSS  would  be 
permitted  to  manipulate  the  queues  of  process  control  blocks  used  to  maintain  the  state  of  the  application 
processes.  Note  that  this  queue  manipulation  could  be  done  concurrently  with  AS  application  processing. 
Some  mechanism  would  then  be  needed,  such  as  the  exchange  jump  of  the  CDC  6600  PTiomtonJ  64),  which 
would  allow  the  OSS  to  force  an  AS  process  switch. 

Concurrency  can  also  be  exploited  in  supporting  fast  process  switching  on  the  AS,  by  using  a  technique 
which  we  term  register  buffering.  The  main  impediment  to  fast  process  switching  is  the  need  to  save  the 
volatile  state  of  the  current  process,  and  load  the  state  of  the  next  process  to  be  executed.  This  volatile  state 
includes  the  general  purpose  registers,  and  may  also  include  the  virtual  memory  address  map  registers. 

In  the  register  buffering  technique,  we  duplicate  the  AS  processor’s  register  set,  as  shown  in  Figure  6-1.  At 
any  given  time,  the  AS  only  has  access  to  one  of  the  register  sets,  called  the  active  set.  The  remaining  register 
set  can  be  freely  accessed  by  the  OSS.  While  the  AS  is  executing  the  application  process  associated  with  the 
active  register  set,  the  OSS  can  concurrently  be  saving  the  other  register  set  in  the  control  block  of  the 
previous  process  and  then  reloading  those  registers  for  the  next  process  to  be  executed.  When  at  last  it  is  time 
to  switch  processes  on  the  AS,  the  OSS  merely  causes  a  switch  in  the  active  register  set,  being  sure  to 
synchronize  the  switch  so  that  it  occurs  between  instructions  on  the  AS.  Thus,  a  process  switch  will  usually 
occur  “instantaneously"  and  without  execution  time  overhead  on  the  AS. 

It  should  be  clear  that  this  register  buffering  technique  can  be  easily  extended  to  more  than  two  register 
sets,  providing  more  buffering  between  the  OSS  and  AS.  Once  this  is  done,  a  further  extension  to  support 
multiple  application  processors  is  quite  straight  forward.  This  register  buffering  technique  is  significant  in 
several  ways: 

1.  As  implied  above,  if  the  OSS  is  usually  able  to  have  the  alternate  register  set  loaded  with  the  state 
of  the  next  process  to  be  executed  prior  to  having  to  switch  processes,  little  or  no  application 
processing  time  will  be  lost  to  process  switching  overhead. 

1  If  there  is  usually  an  alternate,  ready  to  run  pro*.,  available,  the  ability  to  do  a  "free”  switch  to 
that  process  will  allow  even  more  operating  system  processing  to  be  overlapped  with  application 
processing.  A  process  switch  can  be  done  immediately  upon  every  call  to  the  operating  system. 
Processing  of  the  system  call  can  then  proceed  concurrently  with  execution  of  the  alternate  ap¬ 
plication  process. 


Figure  6-1:  An  Architecture  With  Register  Buffering 

3.  Recently  there  has  been  considerable  attention  focused  on  techniques  for  using  many  registers  in 
a  processor's  architecture,  in  order  to  achieve  fast  procedure  calls  [Dannenbe.R  79.  Ditzel 
82,  Patterson  81,  Radin.G  82,  Sites  79],  A  recurring  problem,  however,  has  been  that  the  in¬ 
creased  number  of  registers,  while  making  procedure  calls  fast,  makes  process  switching  very  slow. 
Register  buffering  appears  to  offer  an  effective  solution  to  the  problem  of  how  to  provide  both 
fast  procedure  calls  and  fast  process  switching. 

The  OSS/AS  Interface  required  for  the  register  buffering  technique  involves  the  very  tight  coupling  of  the 
two  subsystems.  The  OSS  must,  be  able  to  access  and  modify  the  processor  registers  of  the  AS.  Currently 
available  processors  do  not  permit  this  type  of  external  access  to  their  “internal"  registers.  As  a  result,  this 
technique  can  only  be  used  in  those  systems  incorporating  a  custom  designed  AS.  rather  than  a  commercially 
available  application  processor.  Even  the  use  of  bit  slice  processors  to  implement  the  AS  requires  careful 
consideration,  since  many  such  devices  will  not  permit  externally  accessible  registers  to  be  used. 

On  the  other  hand,  it  should  be  quite  easy  to  implement  the  technique  for  concurrent  scheduling  of 
processes,  even  in  systems  which  use  a  commercially  available  processor  in  the  AS.  In  this  case  it  is  only 
necessary  for  the  OSS  to  have  access  to  the  portion  of  the  AS  primary  memory  which  contains  the  process 
control  blocks  and  scheduling  queues,  and  have  some  means  of  forcing  the  AS  to  perform  a  process  switch. 
Some  form  of  interrupt  could  be  used  for  this  latter  purpose. 
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Shared  access  by  the  OSS  to  the  AS  primary  memory  is  also  the  chief  requirement  for  the  OSS  to  be  able  to 
support  application  interprocess  communication.  '1110  request  to  send  or  receive  a  message  could  be  conveyed 
from  the  AS  to  the  OSS  via  the  AS  system  bus.  which  is  monitored  by  the  OSS.  If  fast  process  switching  is 
available,  perhaps  through  the  use  of  register  buffering,  the  OSS  can  initiate  an  AS  process  switch  im¬ 
mediately  upon  receipt  of  the  send  or  receive  request.  The  OSS  can  then  carry  out  the  request,  which 
primarily  involves  moving  the  message  from  one  part  of  the  AS  memory  to  another,  concurrently  with 
execution  of  the  new  process  on  the  AS.  In  the  absence  of  fast  AS  process  switching  it  may  be  more  efficient 
to  simply  suspend  AS  processing  until  the  OSS  has  handled  the  send  or  rccicvc  operation,  in  such  a  situauon 
it  would  be  beneficial  to  have  special  hardware  or  microcode  in  the  OSS  to  make  these  operations  very  fast. 

The  OSS  and  AS  concurrency  can  also  be  exploited  when  creating  and  destroying  processes.  Destroying  a 
process  can  be  done  very  quickly  since  it  is  only  necessary  to  mark  the  process  as  destroyed.  The  actual  data 
structure  manipulations  and  other  processing  required  to  purge  the  process  from  the  system  can  then  be 
carried  out  by  the  OSS  while  the  AS  continues  execution  of  the  process  which  invoked  the  destroy  function. 
Since  the  creation  of  a  process  often  involves  the  copying  of  some  state  information  from  the  parent  to  the 
child,  it  would  be  best  to  switch  processes  on  the  AS.  assuming  fast  process  switching  is  available,  so  that  the 
AS  can  continue  with  execution  of  another  process  while  the  create  function  is  being  handled.  This  is  similar 
to  the  technique  used  for  the  interprocess  communication  functions.  Note  that  the  creation  and  destruction 
of  interprocess  communication  paths  can  be  handled  analogously  to  creation  and  destruction  of  processes. 
Furthermore,  all  of  these  techniques  require  only  that  the  OSS  have  access  to  the  primary  memory  of  the  AS. 

6.2. 1.2  Virtual  Memory  and  Protection 

The  provision  of  virtual  memory  and  protection  in  a  computer  system  requires  the  use  of  special  hardware 
to  perform  address  translations  and  protection  checks  at  memory  access  time.  However,  the  management  of 
pages  in  memory,  and  the  handling  of  page  faults,  is  usually  left  to  operating  system  software.  This  is  another 
area  where  concurrency  of  operating  system  and  application  processing  can  be  exploited  to  provide  improved 
and  expanded  support  for  operating  system  functions.  The  BCC  500  [Lee.W  74}  and  SYMBOL  [Richards.H 
75]  are  two  examples  of  systems  which  employed  specialized  processors  to  provide  extra  support  for  memory 
management  and  paging. 

As  suggested  by  Ruggiero  and  Zaky  [Ruggiero.M  80},  one  of  the  main  ways  that  the  OSS  could  take 
advantage  of  the  available  concurrency  is  by  doing  paging  out  ahead  of  time.  In  this  way,  when  a  page  fault 
occurs  there  will  be  space  in  memory  to  accommodate  the  referenced  page  without  first  writing  out  one  of  the 
memory  pages.  The  net  saving  is  one  disk  access  (the  write),  plus  the  time  required  to  run  the  page  replace¬ 
ment  algorithm,  when  servicing  a  given  page  fault.  As  a  result,  a  faulting  process  will  become  ready  to 
continue  execution  in  approximately  half  the  time  as  would  otherwise  be  required. 
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The  OSS/AS  concurrency  also  permits  the  OSS  to  use  more  sophisticated  page  replacement  algorithms, 
without  performing  the  additional  computation  at  the  expense  of  application  processing.  In  particular,  the 
OSS  could  maintain  more  complete  page  fault  histories  for  the  AS  processes  in  order  to  obtain  better  working 
set  estimations.  It  may  even  be  possible  to  anticipate  a  process'  paging  behavior  and  do  some  amount  of 
paging  in  ahead  of  time. 

The  type  of  OSS/AS  Interface  needed  to  support  the  virtual  memory  management  techniques  outlined 
above  involves  shared  access  to  the  physical  address  space  of  the  AS  by  the  OSS.  The  OSS  will  need  to  read 
pages  into  that  space  and  write  pages  from  it  The  physical  address  space  of  the  AS  can  be  regarded  as  a 
proper  subset  of  the  OSS  physical  address  space.  One  possible  arrangement  is  shown  in  Figure  6-1 


Figure  6-2:  An  Architecture  for  Concurrent  Virtual  Memory  Management 


In  the  architecture  of  Figure  6-1  the  OSS  controls  the  virtual  address  map.  which  is  used  to  map  every 
memory  access  made  by  the  AS.  Only  the  OSS  can  manipulate  the  Address  Map  Registers,  which  specify 
how  the  Virtual  Address  Mapping  Unit  is  to  translate  the  virtual  addresses  presented  to  it  by  the  AS  into  the 
corresponding  physical  memory  addresses.  The  OSS  also  controls  the  paging  disk,  which  reads  and  writes 
pages  of  the  AS  physical  memory  using  direct  memory  access.  The  AS  memory  is  shown  as  being  dual 
ported.  However,  a  single  port  memory  on  a  system  bus  which  permits  the  paging  disk  to  steal  DMA  cycles  is 
also  feasible.  Both  the  AS  and  the  OSS  must  be  notified  of  an  application  page  fault  The  AS  can  then 
unwind  the  instruction  which  faulted,  and  the  OSS  can  then  initiate  an  AS  process  switch  and  start  handling 
the  page  fault 


Note  that  in  the  architecture  of  Figure  6*1  the  OSS  has  its  own  private  memory  which  contains  the 


operating  system  code,  system  tables,  etc.  Also  note  that  it  should  be  possible  to  use  commercially  available 
processors  for  both  the  OSS  and  the  AS  in  such  an  architecture.  It  may  even  be  possible  to  use  an  existing 
memory  management  unit  for  the  virtual  address  mapping  function. 

The  functions  of  dynamic  memory  allocation  and  deallocation  can  also  benefit  from  the  exploitation  of 
operating  system  and  application  concurrency.  It  is  possible  for  the  OSS  to  do  all  of  the  free  list  manipula¬ 
tions.  including  concatenation  of  adjacent  free  areas,  concurrently  with  the  continued  execution  of  the  ap¬ 
plication  process.  As  a  result,  a  more  complex  free  list  structure  can  be  used,  such  as  different  free  lists  for  the 
various  sizes  of  blocks,  which  will  permit  allocation  to  be  done  very  quickly.  The  time  required  for  dealloca¬ 
tion  will  remain  very  small,  almost  instantaneous  from  the  application  process'  point  of  view,  since  all  that  is 
required  is  to  flag  the  block  as  deallocated  and  then  do  the  actual  processing  later.  These  concurrent  memory 
allocation  and  deallocation  techniques  require  only  that  the  OSS  have  access  to  the  primary  memory  of  the 
AS.  If  the  OSS  also  controls  the  AS  memory  mapping  unit,  it  can  ensure  that  a  free  page  is  always  preal¬ 
located,  ready  for  use  by  the  allocation  function.  This  also  helps  ensure  that  allocation  can  be  done  very 
quickly. 

6.2.1 .3  Device  Interface 

It  is  now  common  for  an  operating  system  to  virtualize  the  devices  it  supports  by  defining  a  highly  abstract 
interface  for  each  of  them.  In  some  systems,  such  as  UNIX  [Ritchie.D  74],  all  devices  are  made  to  look  like 
files.  In  others,  such  as  the  IBM  System/38  [Hoffman Jl  78],  all  devices  are  made  to  look  like  processes. 

By  having  the  OSS  provide  the  device  interface,  the  AS  is  freed  from  performing  the  processing  required 
for  low  level  device  handling,  such  as  fielding  interrupts  and  providing  buffering.  This  can  be  a  substantial 
saving  when  one  considers  frequently  interrupting  devices  such  as  real  time  clocks.  However,  it  also  points 
out  the  need  for  a  uniform  message  addressing  mechanism  for  both  application  and  operating  system 
processes  (assuming  we  are  making  devices  look  like  processes).  An  application  process  on  the  AS  should  be 
able  to  communicate  with  a  device  handler  “process"  on  the  OSS  in  the  same  manner  as  it  would  communi¬ 
cate  with  any  other  application  process. 

The  OSS/ AS  Interface  requirements,  if  the  OSS  is  to  provide  the  tic /ice  interface,  are  essentially  just  those 
needed  for  the  OSS  to  support  the  AS  interprocess  communication  mechanism,  as  discussed  earlier.  The  only 
difference  here  is  that  sends  and  receives  may  involve  copying  between  the  shared  AS  memory  and  the 
private  OSS  memory.  However,  note  that  this  use  of  message  passing  to  invoke  operating  system  functions 
requires  that  the  message  communication  facility  be  implemented  very  efficiently.  Otherwise  a  great  deal  of 
AS  processing  time  will  be  “wasted”  in  simply  calling  the  operating  system. 


6.2. 1.4  File  System 

The  flic  system  holds  a  prominent  place  in  most  operating  systems.  As  with  device  interfaces,  the  file  server 
can  be  provided  as  a  process  to  which  application  processes  direct  their  requests  via  messages.  In  this  way  it  is 
quite  straight  forward  to  implement  the  file  server  on  the  OSS,  similar  to  the  way  device  handlers  are 
provided.  The  same  message  communication  interface  as  used  for  device  handler  invocation  can  be  used  for 
file  system  requests. 

Performing  the  file  system  functions  on  the  OSS.  concurrently  with  application  processing  on  the  AS, 
removes  a  substantial  execution  overhead  from  the  AS.  Furthermore,  enhanced  capabilities  can  be  added  to 
the  file  system  at  no  cost  to  AS  processing.  Improved  file  read  ahead  and  write  behind  can  dearly  be  done 
concurrently  with  application  processing.  Incremental  backup  of  the  file  system  and/or  replication  of  files  for 
reliability  is  also  possible.  Disk  garbage  collection  and  the  ability  to  run  disk  diagnostics,  without  slowing 
application  processing,  are  two  other  very  significant  ways  in  which  the  available  concurrency  can  be  effec¬ 
tively  exploited. 

It  should  be  noted  that  the  ability  to  run  diagnostics  and  tests  concurrently  with  application  processing  is 
not  unique  to  the  file  system.  This  technique  can  also  be  used  in  the  other  classes  of  operating  system 
functions.  For  example,  the  virtual  memory  manager  could  run  a  memory  diagnostic  on  each  page  of  the  AS 
memory  which  it  pages  out 

6.2.1 .5  User  Interface 

When  we  speak  of  the  user  interface  we  are  refen-ng  to  the  interface  provided  to  the  human  user  when 
interacting  with  the  system.  This  could  be  provided  through  a  process  which  interacts  with  the  terminal  at 
which  the  user  is  located,  interprets  the  input  provided  by  the  user,  and  creates  the  appropriate  processes  to 
carry  out  the  tasks  requested  by  the  user. 

As  with  the  device  interfaces  and  file  system,  there  are  certain  advantages  to  executing  the  user  interface 
processes  on  the  OSS.  Primarily,  it  pennits  the  interface  to  be  made  more  sophisticated  without  reducing  the 
amount  of  processing  power  available  for  executing  applications.  This  is  very  important  now  that  more  and 
more  stress  is  being  placed  on  the  quality  of  the  user  interface  provided  by  systems.  Many  more  specialized 
devices  for  speech  and  graphical  interaction  can  be  economically  accommodated  in  this  way. 


6.2.2  Generic  Concurrency  Techniques 

From  (he  work  which  wc  have  done  thus  far  on  developing  operating  system  and  application  concurrency 
techniques,  as  outlined  earlier  in  Section  6.2.1  (Concurrency  Techniques),  wc  have  noted  that  our  techniques 
fall  into  three  main  classes.  These  classes  represent  generic  concurrency  techniques  which  can  be  applied  in 
the  implementation  of  many  different  operating  system  functions.  The  three  generic  concurrency  techniques 
noted  to  date  are: 

1.  Precomputation 

The  idea  here  is  to  anticipate  the  next  occurrence  of  some  function  and  have  most,  if  not  all  of 

.  chc  required  computation  done  ahead  of  time.  The  register  buffering  technique  is  an  example. 

An  analogy  from  another  area  of  computer  architecture  is  the  instruction  preparation  unit  used  to 
speed  up  the  interpretation  of  instructions  in  many  processors. 

1  Postcomputation 

Sometimes  it  is  possible  to  “pretend”  to  have  completed  a  requested  function  by  simply  flag¬ 
ging  it  as  accomplished,  and  then  actually  doing  the  required  work  afterwards.  Concurrent 
process  destruction  is  an  example.  This  technique  is  analogous  to  lazy  evaluation. 

3.  Shifted  Tradeoff 

Some  functions  come  in  logical  pairs,  such  as  dynamic  allocation  and  deallocation  of  memory. 
Furthermore  there  is  often  a  tradeoff  between  their  execution  speeds.  Depending  on  the  data 
structure  used,  one  or  the  other  will  be  fast  and  its  alternate  slow.  If  one  of  the  functions  can  be 
handled  quickly,  due  to  precomputation  or  postcomputation,  then  it  pays  to  shift  the  tradeoff  so 
that  the  alternate  function  is  more  efficient.  For  example,  memory  deallocation  can  be  handled 
quickly  using  postcomputadon,  so  we  design  the  free  list  data  structure  to  permit  fast  allocation, 
even  though  that  shifts  more  computauon  to  the  deallocation  function. 

At  present,  if  we  cannot  find  a  way  of  effectively  using  one  or  more  of  the  above  techniques  when 
considering  the  implementation  of  some  operating  system  function,  we  must  rely  on  having  fast  process 
switching  available.  If  an  alternate,  runnable  application  process  is  available,  then  that  process  can  be 
executed  while  the  operating  system  function  is  being  handled  for  the  current  process.  Note  that  this  can  be 
regarded  as  another  generic  concurrency  technique,  but  in  this  case  it  does  not  improve  the  execution  time  of 
the  individual  application  process  on  whose  behalf  the  function  is  being  performed.  However,  it  does 
improve  the  overall  system  throughput,  since  other  application  processing  continues  while  the  operating 
system  function  is  being  executed. 


6.3  Instruction  Set  Architecture  Design 


I 

I 

6.3.1  Introduction 

The  increasing  size  and  complexity  of  processor  instruction  sets  has  encompassed  additional  data  types  (c.g., 
|  floating  point,  decimal,  character  strings,  arrays,  priority  queues,  linked  lists),  operating  system  support  (e.g.. 

process  management,  synchronization,  interprocess  communication),  and  compensations  for  other  disabilities 
(c.g,  addressing  modes  to  deal  with  insufficient  instruction  address  field  length).  This  trend  is  typified  by 
such  popular  complex  instruction  set  computers  (CISC)  as  the  VAX  and  the  Motorola  68000,  and  currently 
|  culminates  in  the  Intel  432:  The  intent  has  been  to  improve  performance,  and  especially  in  the  case  of  the 

432.  to  reduce  software  costs. 

Recently  an  alternative  design  approach  has  been  widely  publicized  as  the  "Reduced  Instruction  Set  Com* 
putcr  (RISC)".  Three  research  machines  in  particular,  the  IBM  801  [Radin  83),  the  Stanford  MIPS  [Hennessy 
82].  and  the  Berkeley  RISC  I  [Patterson  82a].  are  based  on  the  belief  that  computers  with  simpler  instruction 
sets  are  not  only  less  expensive  to  design  and  build,  but  also  offer  greater  performance  than  the  more 
traditional  complex  computers. 

We  feel  that  both  approaches  have  merit,  but  that  neither  is  sufficiently  scientific,  and  we  do  not  find  much 
credible  evidence  for  the  claims  of  either  RISC  or  CISC  proponents.  Ad  hoc  designs  and  implementations 
have  been  done  but  not  evaluated,  the  effects  of  orthogonal  issues  have  not  been  separated  out.  systems  which 
differ  in  kind  have  been  "compared",  important  attributes  which  are  difficult  to  quantify  have  been  presumed 
not  relevant,  and  complexities  have  been  "removed"  by  moving  them  to  different  places  in  the  system. 
Unfortunately,  many  important  ramifications  of  this  controversy  have  remained  unappreciated  due  to  un¬ 
questioning  acceptance  of  either  point  of  view. 

In  this  paper  we  briefly  discuss  some  of  the  topics  of  contention,  note  where  additional  research  is  needed  to 
attain  better  understanding,  and  generally  argue  for  a  view  of  the  matter  which  is  broader  than  just  instruction 
set  size  and  complexity. 

We  will  use  the  ternt  RISC  to  imply  all  research  efforts  concerning  reduced  instruction  set  computers. 
RISC  I  will  be  specifically  used  to  refer  to  the  research  being  pursued  at  Berkeley. 

6.3.2  Notions  of  Simplicity 

Perhaps  the  most  fundamental  RISC  tenet  is  that  the  most  primitive  instructions  dominate  a  computer’s 
activity,  and  that  their  performance  will  be  adversely  affected  by  the  inclusion  of  anything  more  complex  in 
the  instruction  set  The  tenet  carries  with  it  an  implicit  assumption,  which  is  not  necessarily  true,  that  any  loss 
in  performance  is  inherently  bad.  We  will  discuss  this  further  in  Section  6.3  J.  This  section  explores  some  of 
the  implications  and  problems  that  follow  from  this  tenet  ■ 
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6.3.2. 1  Perceiving  Distinct  Qualities 

Unfortunately,  the  terms  reduced  and  complex  have  been  contraposed  in  the  context  of  this  first  tenet  of  (he 
RISC  philosophy,  as  Gark  has  pointed  out  [Clark  80).  In  fact,  two  orthogonal  instruction  set  dimensions  arc 
at  issue  here:  size  (reduced  vs.  massive)  and  complexity  (simple  vs.  complex).  The  first  dimension  concerns 
the  number  of  instructions  (addressing  modes,  number  of  possible  values  in  instruction  fields  in  general)  that 
characterize  an  architecture.  The  other  concerns  the  functional  complexity  of  the  instructions  as  might  be 
represented  by  the  number  of  "primitive"  operations  that  would  be  needed  to  synthesize  them.  This  dimen¬ 
sion  is  much  harder  to  quantify  since  mixtures  of  simple  and  complex  instructions  can  exist  within  the  same 
architecture. 

It  is  true  that  reduced  and  simple  take  on  a  mutually  reinforcing  relationship  in  the  context  of  RISC  design, 

as  massive  and  complex  normally  do  in  the  CISC  domain.  This  does  not  have  to  be  the  case.  Simplicity 

means  different  things  to  chip  designers,  computer  architects,  and  all  other  people  involved  in  the  design 

process.  The  VAX  has  often  been  singled  out  as  being  a  complex  architecture.  Yet,  from  the  designer’s  point 

of  view,  the  VAX  was  to  be  a  simple  yet  massive  instruction  set  The  definition  of  simplicity  used  in  this 

\ 

context  was: 

those  attributes  (other  than  price)  that  make  minicomputer  systems  attractive.  These  include 
approachability.-understandability,  and  ease  of  use  [Strecker.W  78]. 

It  is  quesdonable  whether  or  not  this  goal  was  achieved,  but  it  will  be  argued  iater  that  in  some  ways,  the 

issue  of  simplicity  may  not  be  of  prime  importance. 

6. 3. 2. 2  The  Utility  of  Complex  Instructions 

RISC  proponents  warn  of  detrimental  efTects  due  to  the  use  of  complex  instructions.  Nevertheless,  the 

popularity  of  installing  support  for  specialized  functions  such  as  interprocess  communication  (IPQ  seems  to 

be  undiminished.  The  designers  of  the  ELXSI 6400  report  [Olson  83],  for  instance, 

A  key  architectural  feature  which  allows  the  operating  system  to  cope  with  the  tremendous 
variability  in  its  hardware  environment  is  the  microcode  and  hardware  implemented  message 
system.  The  use  of  messages  allowed  us  to  make  choices  in  the  CPU  and  operating  system 
architecture  which  greatly  enhance  the  effectiveness  of.  additional  processors. 

But  we  concur  with  one  of  the  RISC  critirisrhs  of  the  published  accounts  of  these  machines:  it  is  not  enough 

to  show  that  a  complex  instruction  executes  faster  than  an  equivalent  sequence  of  primitive  instructions.  It 

must  also  be  shown  that  the  net  effect  is  to  improve  system  performance.  We  believe  that  this  aspect  of  the 

problem  must  be  part  of  the  design  effort 

Even  the  premise  that  primitive  instructions  always  dominate  a  computer's  activities  is  not  universally  true. 
The  instruction  set  interface  of  machines  designed  to  run  operating  systems  may  be  solely  at  the  "system  call” 
level,  for  example. 
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Fhcrc  arc  many  other  computing  environments,  such  as  real-time  or  signal  processing  systems,  where  it 
would  be  hard  to  argue  against  supporting  complex  functions  directly  in  the  computer  architecture  and 
implementation.  More  generally,  Radin  has  written  (Radin  831: 

It  is  often  true  that  implementing  a  complex  function  in  random  logic  will  result  in  its  execution 
being  significantly  faster  than  if  the  function  were  programmed  as  a  sequence  of  primitive  instruc¬ 
tions.  Examples  are  floating  point  arithmetic  and  fixed  point  multiply.  We  have  no  objections  to 
this  strategy,  provided  the  frequency  of  use  justifies  the  cost  and,  more  important,  provided  these 
complex  instructions  in  no  way  slow  down  the  primitive  instructions  [Radin  83}. 

We  subscribe  to  this  statement,  but  we  assert  that  "frequency  of  use"  is  an  insufficient  criterion  for  justifying  a 

given  instruction.  As  Dark  and  Levy  have  pointed  out  [Clark  82J: 

Aggregate  statistics  alone  cannot  guide  the  design  of  an  instruction  set  intended  for  different 
languages  and  applications.  In  particular,  instructions  that  are  infrequently  used  overall  can  be 
critical  for  some  intended  users. 

The  notion  that  complex  functions  slow  down  the  simple  actions  of  a  computer  seems  to  be  the  real 
problem  that  prevents  us  from  having  the  best  of  all  worlds.  We  believe  that  serious  research  efforts  in  the 
areas  of  functional  partitioning,  instruction  interpretation,  and  distributed  decoding  will  produce  computer 
structures  which  reduce  or  eliminate  this  effect.  Until  research  is  directly  aimed  at  this  problem,  a  greater 
understanding  of  the  scientific  truths  and  principles  involved,  as  opposed  to  the  folklore  currently  being 
disseminated,  is  not  possible. 

6.3.2.3  Designing  Simple  Machines 

The  RISC  simplicity  tenet  has  a  related  side-effect  in  that  simpler  computers  are  thought  to  be  easier  and 
faster  to  design  than  complex  ones.  Unfortunately,  the  comparisons  that  have  been  published  to  substantiate 
this  tenet  are  based  on  design  times  for  a  student  project  simple  microprocessor  versus  the  design  times  for 
some  current  complex  commercial  microprocessor  products  [Patterson  82a}.  While  these  comparisons  seem 
interesting,  we  do  not  find  them  relevant,  since  the  objectives,  constraints,  and  design  tasks  are  significantly 
different  between  the  academic  and  industrial  environments.  Design  considerations  such  as  yield,  testability, 
and  fault  tolerance  are  not  handled  in  the  same  way  for  both  contexts.  Logistical  and  administrative  factors 
necessarily  imposed  by  a  large  organization  (e.g.,  synchronizing  simultaneous  development  of  support  chips, 
software  development  systems,  and  fabrication  facilities)  cannot  be  disregarded.  It  strikes  us  as  improper  to 
make  any  such  comparisons  without  first  attempting  to  calibrate  the  units  of  measurement. 

To  make  matters  more  confusing,  comparing  the  hardware  design  times  of  processors  of  different  scale  is 
misleading  since  complexity  shed  by  the  processor  design  team  could  well  be  encountered  by  the  system 
software  designers  or  even  the  applications  programmers.  The  tables  of  comparisons  don’t  even  hint  at  the 
tradeoffs.  As  Justin  Rattner  has  said  [Barney  82],  "They  say  that  the  RISC  (I)  chip  was  developed  in  6%  of  the 
time  it  took  for  the  432  ...  My  response  is,  'Yes,  and  they  only  did  6%  of  the  job.’"  The  hardware/software 
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partitioning  of  a  design  begs  for  a  more  detailed  analysis.  In  particular,  an  economic  analysis  of  the 
hardwarc/software  design  cycle  tradeoffs  would  be  of  strong  practical  interest 

6. 3. 2. 4  Complexity  Migration 

One  specific  form  of  transferring  complexity  away  from  the  hardware  is  by  migrating  functions  to  compile 
time  which  previously  were  considered  run  time  activities  that  were  supported  by  hardware.  One  of  the  three 
criteria  for  instruction  set  design  in  the  801  was  that  the  operation  could  not  be  moved  to  compile  time.  The 
801  approach  also  utilized  a  very  sophisticated  compiler  to  make  some  of  these  tradeoffs  (for  example,  by 
precomputing  functions  wherever  possible). 

This  concept  of  complexity  migration  is  the  basis  for  MIPS  [Hennessy  82],  which  is  based  on  a  pipeline 
implementation  having  no  hardware  interlocks.  The  only  means  of  ensuring  proper  sequencing  of  events  in 
this  machine’s  instruction  stream  is  via  a  pipeline  reorganizer  program.  Although  a  straightforward  compiler 
can  be  used  to  generate  valid  code  for  this  machine,  it  is  only  by  using  the  reorganizer  that  the  machine's 
pipeline  can  be  fully  utilized. 

Of  course,  not  all  complex  functions  can  be  moved  to  compile  time.  Dynamic  program  activities,  such  as 
garbage  collection  and  bounds  checking,  must  of  necessity  be  done  at  run  time.  But  as  further  work  is  done  to 
evaluate  the  merits  of  complexity  migration  to  compile  time,  computer  system  designers  will  be  able  to  make 
decisions  based  on  evidence  rather  than  educated  guesswork. 

6.3.3  Importance  of  the  Performance  Aspects  of  Computer  Design 

Throughout  the  RISC  literature  there  is  a  largely  unstated  but  pervasive  bias  towards  those  aspects  of  a 
computer  system  dealing  with  performance.  Clearly,  if  all  other  attributes  are  equal,  higher  performance 
must  be  considered  an  improvement  to  a  machine.  However,  we  believe  that  it  is  possible  to  ascribe  too  much 
importance  to  the  performance  dimension  of  a  computer  system.  Since  performance  is  the  most  quantifiable 
measure  of  a  machine,  it  is  the  most  frequently  discussed  and  measured  -  not  because  performance  is  always 
inherently  so  much  more  valuable  than  other  system  parameters,  but  because  benchmarking  is  the  easiest  way 
of  comparing  system  alternatives.  It  is  a  mistake  to  pursue  performance  blindly  without  explicitly  ack¬ 
nowledging  what  is  being  traded  for  it 

6.3.3. 1  What  is  meant  by  performance? 

System  performance  can  be  measured  in  different  ways,  and  these  differences  can  be  significant,  because 
they  reflect  the  fundamental  goals  of  the  system.  For  example,  performance  can  be  measured  in  terms  of 
peak  instruction  execution  rate,  a  number  which  may  be  of  interest  in  deciding  the  suitability  of  various 
supercomputers  to  some  proposed  task.  Another  performance  measure  is  response  time,  which  is  of  par- 


ticular  interest  in  real-time  control  systems.  Yet  a  third  performance  measure  is  average  system  throughput. 
We  assume  that  it  is  this  measure  which  is  being  discussed  in  the  RISC/CISC  literature. 

6.3. 3. 2  Performance  vs.  Other  System  Aspects 

It  is  our  view  that  a  very  wide  range  of  performance  is  currently  available  in  the  marketplace,  and  that, 
except  for  the  most  demanding  applications,  a  user  with  sufficient  money  can  buy  whatever  performance  is 
desired.  We  agree  that  the  dramatic  declines  in  the  price/performance  ratio  have  been  largely  responsible  for 
the  enormous  economic  growth  in  the  field. 

But  it  strikes  us  that  significant  performance  gains  will  be  given  to  us  almost  free  by  the  semiconductor 
technologists,  first  by  the  constantly  improving  fabrication  process  (driving  down  gate  delays),  and  second  by 
the  increasing  integration  densities  that  will  allow  more  computational  activity  to  occur  without  the  need  to  go 
off-chip.  Device  technologists  can  not  address  other  aspects  of  a  computer  system,  however.  For  example, 
even  if  the  military  had  arbitrarily  large  amounts  of  money,  they  could  not  buy  systems  with  the  level  of 
modularity  and  expandability  they  desire,  because  we  architects  do  not  yet  know  how  to  provide  it  at  any  cost 
Conversely,  the  Japanese  Fifth  Generation  Project  requires  such  large  increases  in  performance  that  it  is 
commonly  assumed  that  no  von  Neumann  architecture  will  ever  be  able  to  provide  it  (regardless  of  the 
complexity  or  simplicity  of  the  instruction  set).  Hence,  research  is  being  pursued  on  multiple  processor 
architectures,  where  the  bulk  of  the  performance  results  from  combining  large  numbers  of  processors. 
Research  to  deal  with  these  problems  should  be  directed  at  the  interconnect  and  usage  problems  as  much  as 
the  processors  themselves. 

6.3.4  Ambiguous  Performance  Claims 

Much  of  the  interest  generated  by  the  RISC  efforts  comes  from  the  reported  performance  improvements. 
One  study  [Patterson  82b],  for  example,  lists  the  execution  times  of  the  simulated  Berkeley  RISC  I  chip  vs.  the 
68000,  the  Z8002,  the  VAX  11/780,  the  PDP11/70,  and  the  BBN  C/70.  For  every  benchmark  measured 
(benchmarks  included  Ackerman’s  function,  quicksort,  and  the  puzzle  program)  the  simulated  RISC  I  chip 
promised  faster  execution  times. 

We  do  not  find  that  the  performance  claims  that  have  been  published  are  conclusive  evidence  that  a 
breakthrough  in  the  price/performance  ratio  has  been  achieved.  We  will  state  some  of  our  reservations  in  the 
next  few  sections. 


6.3.4. 1  All  or  Nothing 

RISC  machines  have  not  only  a  reduced  instruction  set,  but  also  many  other  items  which  affect  perfor¬ 
mance.  For  example,  as  we  have  pointed  out  earlier  [Colwell  83J,  we  believe  that  the  overlapping  register 
window  scheme  used  in  RISC  I  accounts  for  a  substantial  amount  of  the  performance  expected  from  that 
machine,  and  can  be  of  value  to  CISC  machines  as  well.  Likewise,  we  would  like  to  know  exactly  what 
performance  improvement  to  expect  from  the  compiler  and  pipeline  management  techniques  used  in  MIPS. 
We  feel  that  it  would  be  much  more  meaningful  to  compare  reduced  instruction  set  machines  to  CISCs  on 
those  aspects  which  are  unique  to  each,  factoring  out  those  which  can  be  utilized  in  either  style. 

6. 3. 4. 2  Fair  Comparisons 

The  Intel  432  would  seem  to  be  a  very  promising  candidate  for  close  scrutiny  in  this  RISC/CISC  con¬ 
troversy.  It  can  be  considered  an  archtypical  CISC:  it  has  a  complex  instiuction  set  (including  such  instruc¬ 
tions  as  BROADCAST  TO  PROCESSORS  and  LOCK  OBJECT):  it  is  programmable  only  in  Ada,  not 
assembler;  and  it  is  a  complete  computer  system,  including  an  operating  system  kernel.  Although  the  432 
performance  study  [Hansen  82a]  did  not  mendon  RISC  L  the  same  benchmarks  were  used  and  it  is  a  simple 
matter  to  correlate  the  two  reports  to  arrive  at  the  conclusion  that  the  432  runs  the  benchmarks  about  two 
orders  of  magnitude  slower,  in  generaL 

But  as  we  pointed  out  in  [Colwell  83],  it  is  important  not  to  overlook  other  aspects  of  the  432  that  have 
affected  these  results.  For  example,  the  432  is  an  object-oriented  machine.  This  object  orientadon  was 
provided  to  support  the  intended  software  programming  environment,  and  is  an  attempt  to  minimize  life 
cycle  cost,  not  performance  per  se.  However,  the  other  machines  measured  in  [Hansen  82a]  were  not  object 
oriented,  so  we  are  left  unsure  as  to  what  part  of  the  reported  performance  loss  in  the  432  is  due  to  its  object 
orientadon.  The  object  faciliues  cannot  be  removed  from  the  432  for  comparison  purposes,  but  they  can  be 
added  in  software  to  other  machines  to  make  the  results  usefuL 

All  benchmarks  reported  in  [Patterson  82b]  were  coded  in  the  C  programming  language.  We  do  not  object 
to  C  as  the  high-level  language  (HLL)  of  choice.  Given  that  machine-dependent  aspects  of  the  C  language 
are  avoided,  this  eliminates  one  source  of  uncertainty.  However,  not  all  comparisons  are  set  up  in  this  way. 
Our  concern  is  that  the  432  has  only  an  Ada  compiler.  Thus  the  quality  of  the  Ada  compiler,  as  well  as  the 
efficiency  of  the  Ada  language  itself,  are  in  question  here.  We  feel  that  these  two  variables  alone  render  the 
432  results  inconclusive. 

Another  interesting  aspect  of  the  432  that  may  have  a  negative  effect  on  its  single-processor  performance  is 
its  innate  multiprocessing  support.  This  feature  was  designed  at  the  system  level  so  that  processors  can  be 
added  and  automatically  utilized  without  software  participation.  As  far  as  we  know,  this  is  one  of  the  few 
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system  architectures  that  have  ever  been  produced  with  this  capacity.  Intel  has  estimated  chat  support  for 
multiprocessing  takes  approximately  13%  of  the  available  microcode  space  on  chip,  which  we  take  as  more 
evidence  that  this  is  not  a  trivial  function  to  implement.  Since  the  machines  to  which  the  432  is  being 
compared  do  not  have  this  kind  of  support,  we  can  only  wonder  to  what  extent  this  feature  skews  the 
conclusions  that  one  might  attempt  to  draw. 

The  432  also  provides  some  hooks  for  enhancing  system  reliability,  such  as  fault  handling  and  functional 
redundancy  microcode.  In  a  naive  comparison  of  instruction  execution  rates  this  hidden  functionality  will 
appear  as  unseen  baggage,  dragging  down  the  machine's  performance  on  the  benchmark.  Benchmarking  is  a 
very  interesting  exercise,  but  unless  the  machines  being  compared  differ  only  along  one  major  dimension,  it  is 
difficult  to  make  a  fair  comparison.  An  unfair  comparison  is  inconclusive. 

6.3. 4. 3  Justification  and  Analysis 

The  important  question  is  whether  these  extra  features  in  CISCs  such  as  the  432  contribute  enough  to  have 
made  the  apparent  performance  loss  and  extra  design  complexity  worthwhile.  One  study  [Cox  83]  reports 
that  the  432’s  support  for  its  interprocess  communication  primitives  do  indeed  speed  up  those  operations  by 
large  amounts  over  the  software  approach  used  in  comparable  machines.  This  proves  that  a  SEND  can  be 
executed  faster  if  we  are  willing  to  devote  system  resources  to  it,  but  it  leaves  unanswered  several  other 
questions  of  equal  importance.  How  were  the  complex  functions  like  SEND  chosen  in  the  first  place?  What 
was  gained  on  a  system-wide  basis  by  including  these  functions?  What  was  the  cost,  both  in  resources  and  in 
low-level  instruction  performance?  We  suggest  that  one  of  the  tasks  facing  computer  architecture  research  is 
to  find  out  how  to  assign  better  life-cycle  cost  models  to  the  systems  we  build,  so  that  the  performance  aspects 
don't  receive  improper  weighting,  either  positive  or  negative.  Especially  in  an  system  architecture  as  radically 
different  as  the  432,  it  is  incumbent  upon  the  designers  to  carefully  justify  the  design  tradeoffs  they  have 
made.  It  seems  to  us  that  they  have  done  so  at  the  system  level,  making  the  case  that  object  orientation  is  a 
goal  worth  pursuing.  However,  they  do  not  attempt  this  same  justification  at  the  architecture  or  implemen¬ 
tation  levels,  nor  do  they  analyze  the  resulting  machine.  A  good  understanding  of  existing  CISCs  is  not 
possible  without  examining  the  tradeoffs  at  those  levels. 

6.3.5  Architectures  and  Implementations 

Since  the  early  sixties,  computer  designers  have  found  it  beneficial  to  conceptually  separate  a  machine’s 
architecture  from  its  implementation.  This  dichotomy  was  useful  when  trying  to  decompose  the  design 
problem.  However,  its  economic  strength  came  then,  as  it  does  today,  from  software  compatibility. 

After  20  years  of  using  this  concept,  a  sense  of  good  and  bad  computer  architecture  has  developed  around 
the  notions  of  purity.  To  quote  from  Blaauw  and  Brooks  [Blaauw  82]:  ; 
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The  architecture  must  be  comprehensible  and  consistent,  so  it  will  be  easy  to  learn  and  use.  The 
user  beholds  the  whole  system.  It  will  be  easy  for  him  to  master  and  use  it  only  to  the  degree  that 
it  shows  an  integrity  of  concept,  a  consistency  of  viewpoint,  tying  together  all  the  design  decisions. 

6.3.5. 1  The  Rules  Have  Changed 

The  recent  research  in  RISC  concepts  has  stretched  the  fabric  of  some  purist  computer  architecture  notions. 
To  begin  with,  software  development  costs  are  still  of  major  concern  to  most  installations,  but  not  at  the 
assembly  level.  While  there  are  probably  more  assembly  programmers  hacking  today  than  we  care  to  know, 
the  world  is  being  dominated  by  high-level  language  code.  The  importance  of  high-level  language  program¬ 
ming  is  reflected  by  the  fact  that  almost  all  new  general  purpose  computing  machines.  RISCs  and  CISCs 
alike,  are  founded  on  optimizing  the  execution  of  compiled  code.  (This  is  certainly  no  new  idea  in  light  of  the 
long  history  of  Burroughs  high-level  language  machines.) 

6. 3. 5. 2  Departures  From  Purity 

Purist  notions  notwithstanding,  it  seems  indisputable  that  blurring  aspects  of  architecture  and  implemen¬ 
tation  can  often  lead  to  better  machines.  Again,  from  the  notes  of  Blaauw  and  Brooks: 

_  some  of  the  genius  of  Seymour  Cray's  work  _  lies  precisely  in  his  total  personal  control  of 
architecture,  implementation,  and  realization,  and  his  consequent  freedom  in  making  trades  across 
the  boundaries. 

While  the  separation  of  these  factors  produces  conceptually  cleaner  architectures,  and  might  aid  in  the 
partitioning  of  the  design  task.  RISC  research  trades  these  advantages  for  possible  performance  improve¬ 
ments.  For  example,  a  cache  is  normally  invisible  to  the  software,  yet  the  801  has  explicit  instructions  for 
cache  control  so  that  the  computer  does  not  perform  unnecessary  cache  line  loads  and  stores.  Instruction 
ordering  constraints,  as  imposed  by  a  machine's  implementation,  are  present  in  the  801.  the  RISC  L.  and 
especially  MIPS  with  its  non-interlockcd  pipeline.  These,  too,  are  attempts  to  optimize  a  machine’s  perfor¬ 
mance  by  trading  across  classical  boundaries. 

Traditionally,  microcoding  has  been  a  powerful  implementation  technique  for  instruction  interpretation 
which  has  made  designing  massive/complex  machines  like  the  432  and  the  VAX  tractable.  In  an  interesting 
twist  of  concepts,  many  RISC  researchers  view  their  machine  architectures  as  exposing  what  might  otherwise 
be  a  hidden  vertically  coded  microengine.  While  RISC  instruction  bits  drive  control  lines  every  cycle  via 
minimal  decoding,  which  is  reminiscent  of  traditional  microcoded  instructions,  such  a  view  ignores  the  con¬ 
ceptual  and  cultural  differences  between  macrocode  and  microcode.  Manufacturers  have  generally  not  sup¬ 
ported  machines  with  such  exposed  microcngines  (implementations)  since  they  require  individual  compilers 
(and  hence,  cannot  share  object  code  with  other  machines)  and  are  severely  limited  in  the  types  of  changes 
that  can  be  made  to  them  after  their  release.  We  do  not  believe  that  these  problems  will  remain  for  long,  as 
will  soon  be  explained.  ( 
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6. 3. 5. 3  Moving  to  Higher  Ground 

Since  programmers  never  sec  the  machine-level  interface,  and  since  performance  gains  arc  possible  by- 
mixing  architecture  and  implementation,  do  "computer  families",  in  the  traditional  sense,  have  a  place  in  the 
futures  of  computer  companies?  The  designers  of  computers  as  diverse  as  the  VAX.  the  801,  the  432.  and 
RISC  I.  all  wanted  their  machines  to  be  good  targets  for  compiled  code.  The  natural  question  to  ask 
becomes:  "Why  don't  computer  companies  market  'families'  of  machines  that  are  compatible  at  the  system 
software  level?”  The  HLL  programs  would  be  compatible  on  "families”  of  machines  that  could  span  a 
spectrum  of  price/pcrformance  ranges.  Each  member  of  the  family  would  be  free  to  trade  architecture  and 
implementation  features  to  optimize  performance.  True,  unique  compilers  would  be  needed  for  each  family 
member,  since  instruction  sets  would  differ,  but  this  issue  may  decline  in  importance  with  the  entrance  of 
automated  compiler-compilers  [Wulf  80]. 

A  possible  problem  arises  with  the  definition  of  the  common  system  software  interface.'  Just  compare  a 
UNIX  manual  with  almost  any  machine  definition  you  can  find.  It  seems  a  hard  enough  task  just  to  define 
such  a  massive  interface  without  having  to  ensure  compatibility  of  that  interface  across  many  machines  and 
their  system  software.  Validation  becomes  a  much  larger  issue  than  simply  running  a  suite  of  test  programs. 
This  challenge  is  not  necessarily  insurmountable,  but  it  is  not  well  understood  at  present.  The  idea  is  not  a 
new  one,  just  one  that  has  yet  to  succeed  on  a  grand  scale.  Probably  the  largest  hurdle  will  be  in  overcoming 
the  electro-political  status  quo  that  has  dictated  how  computer  systems  should  be  structured  for  the  last  20 
yean. 

6.3.6  Conclusion 

The  RISC  advocates  have  put  forth  a  perspective  on  processor  architecture  and  implementation  which  is 
more  coherent  and  concrete  than  those  which  seem  to  guide  most  CISC  proponents.  We  feel  that  this 
perspective  is  both  interesting  and  insightful  in  some  ways,  yet  oversimplified  and  thus  justly  controversial. 

6.4  Multiple  Register  Sets 

6.4.1  Introduction 

The  Archons  project  is  an  attempt  to  define  and  implement  decentralized  resource  management 
mechanisms  at  the  operating  system  level  and  below  in  a  computer.  We  believe  that  such  a  system  could 
benefit  by  having  special  processors  execute  operating  system  functions.  Not  only  could  these  processors 
increase  system  performance  by  processing  concurrently  with  their  associated  applications  processors,  but 
they  could  also  be  tailored  to  support  the  decentralized  control  functions.  We  are  interested  in  defining  such 
a  machine,  which  we  call  Meta. 


The  characteristics  of  the  Archons  operating  system  arc  currently  being  defined.  Kven  without  their 
definition,  it  is  clear  that  the  semantic  level  of  these  functions  could  be  quite  high.  Machine  operations  to 
provide  direct  support  for  communication,  atomic  transactions,  or  resource  allocation  could  be  envisioned. 
This  immediately  places  this  machine  in  the  midst  of  the  current  heated  RISC/CISC  (Reduced  Instruction 
Set  Computer  vs.  Complex  Instruction  Set  Computer)  controversy  [Patterson  80a.  Clark  80.  Ditzel  80].  In 
our  view,  much  of  the  research  in  this  area  has  been  interesting  but  inconclusive  for  reasons  that  we  will 
explain.  To  help  clear  our  view  of  this  conflict,  we  arc  performing  two  experimental  studies  that  will  produce 
some  direct  results  on  particular  issues. 

For  consideration  in  the  Meta  machine,  we  would  like  to  find  an  existence  proof  for  the  performance  value 
of  complex  instructions  in  some  environment.  As  a  means  to  this  end,  the  first  study  investigates  several 
aspects  of  the  Intel  432:  the  extent  to  which  simple  instruction  performance  may  be  degraded  due  to  com* 
plexity;  the  extent  of  performance  degradation  due  to  object  orientation;  and  the  extent  to  which  perfor¬ 
mance  is  increased  via  the  machine's  hardware/ firmware  support  for  complex  functions.  The  experimental 
method  we  propose  is  to  migrate  the  instruction  set  of  the  432,  including  its  object  orientation,  to  more 
conventional  processors.  The  separation  of  object-oriented  overhead  from  instruction  set  complexity  issues 
should  make  performance  evaluation  studies  of  complex  processors  more  relevant  and  conclusive. 

The  second  study  takes  a  complementary  tack  on  our  RISC/CISC  concerns.  While  reduced  insuuction  set 
advocates  have  stated  reasons  why  processors  with  limited  functionality  might  offer  improved  performance, 
experimental  evidence  to  support  these  views  is  needed  to  help  validate  such  claims.  A  few  studies  have 
attempted  to  do  this  by  evaluating  particular  reduced  instruction  set  architectures.  The  RISC  I 
architecture  [Patterson  82a,  Patterson  82c,  Foderaro  82]  is  the  subject  of  one  such  study.  Indeed,  the  reported 
performance  of  this  machine  is  high  enough  to  draw  attention.  Included  in  this  machine  is  a  mechanism  for 
providing  each  procedure  with  its  own  register  set  while  saving  the  state  of  previous  procedures  in  other 
register  sets.  This  mechanism,  called  multiple  register  sets  here,  is  used  to  save  the  RISC  I  many  memory 
accesses  that  it  would  otherwise  have  to  perform  as  part  of  its  procedure  linkage. 

Unfortunately,  it  is  hard  to  evaluate  reduced  instruction  set  concepts  based  on  the  results  from  the  RISC 
I.  This  is  because  no  attempt  is  made  in  these  simulations  to  decouple  the  performance  effects  of  the  reduced 
instruction  set  from  those  of  the  multiple  register  sets.  Indeed,  we  believe  that  instruction  sets  and  multiple 
register  sets  have  orthogonal  effects  on  performance.  If  this  is  so.  then  multiple  register  sets  could  be  used  to 
equal  advantage  in  both  reduced  and  complex  instruction  set  architectures.  Since  we  are  interested  in  the 
experimental  support  for  reduced  instruction  set  concepts,  and  since  we  also  are  curious,  as  computer  en¬ 
gineers,  about  the  effects  of  multiple  register  sets  on  computer  architectures,  we  have  started  a  study  to 
evaluate  such  effects.  It  should  be  noted  that  the  word  "architecture”  is  defined  here,  as  in  [Amdahl  64],  to 
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K  mean  the  description  of  "the  attributes  of  a  system  as  seen  by  the  programmer.  i.c..  the  conceptual  structure 

K  and  functional  behavior,  as  disctinct  from  the  organization  of  the  data  flow  and  controls,  the  logical  design. 

I  and  the  physical  implementation." 


6.4.2  Evaluating  Complex  Instructions 

The  design  of  a  computer's  instruction  set  can  be  driven  in  many  different  ways.  A  machine  like  the  VAX 
may  require  an  instruction  set  Dai  maintains  some  compatibility  with  previous  machines  so  that  the  customer 
base  is  not  alienated.  Machines  such  as  RISC  1  fPatterson  82a]  and  MIPS  [Hennessy  82J  attempt  to  make  the 
best  use  of  the  implementation  technology  VI_S1.  hence  their  architectures  reflect  concerns  such  as  off-chip 
delays,  amount  of  design  effort,  and  research  goals.  The  iAPX  432  [Intel  81}  design  was  driven  by  the  desired 
software  methodology:  object  orientation. 

There  have  even  been  architectures  proposed  that  defer  the  instruction  set  choice  until  after  the  computer 
has  been  delivered  to  the  user  [Brakefield  82,  Jensen  77).  This  concept  is  significantly  different  from  that  of 
the  common  "writeable  control  store"  (WCS).  The  degree  to  which  a  WCS  machine  can  be  re-configured  is 
severely  limited  by  the  fixed  data  paths,  register  sets,  and  control  word  conventions  characteristic  of  those 
machines.  To  defer  the  instruction  set  choice,  one  must  combine  the  notion  of  "opcode",  which  can  be 
viewed  as  a  "hardware  procedure  call",  with  the  common  software  procedure  call.  Programming  for  such  a 
machine  would  consist  of  sequences  of  function  calls,  each  of  which  would  invoke  some  hardware  or  software 
(or  both)  in  order  to  effect  the  desired  result.  Such  a  function  invocation  mechanism  is  the  basis  for  Meta. 
However,  even  for  a  machine  as  unconventional  as  Meta,  the  issue  of  support  for  complex  functions  arises. 
Are  complex  instructions  necessary  and/or  beneficial  for  such  support?  The  benefits  and  the  costs  of  includ¬ 
ing  complex  instructions  are  not  clearly  understood.  How,  then,  do  we  go  about  investigating  this  tradeoff? 
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We  could  try  looking  for  precedents.  The  very  large  majority  of  computer  instruction  sets  that  have 
appeared  within  the  last  7  -  10  years  are  what  is  CISCs.  These  are  characterized  by  large  numbers  of 
instructions  (typically  hundreds),  a  rich  set  of  addressing  modes  (say,  six  or  more),  and  the  inclusion  of 
specialized  instructions,  whether  for  high-level  language  support  (the  VAX  CASE  instruction)  or  system  level 
support  (the  432’s  SEND  operator). 

Typically,  computer  architects  strive  for  high  performance,  and  this  has  traditionally  been  the  rationale  for 
installing  instructions  with  high  semantic  content.  Very  recently,  however,  a  number  of  articles  have 
appeared  [Patterson  80a,  Patterson  82a,  Hennessy  82]  arguing  that  the  way  CD  increased  performance  lies  not 
in  more  capable  instruction  sets  but  in  simpler  ones.  To  support  this  contention,  machines  such  as  RISC  I 
have  been  designed  and  some  comparisons  have  been  made  against  complex  machines  such  as  the  VAX  and 
the  68000.  t- 


When  we  try  to  assess  the  RISC  arguments  we  find  that  some  of  the  perspectives  arc  valuable  and  per¬ 
suasive.  However,  we  have  reservations  about  the  comparisons  that  have  been  published.  In  particular,  we 
question  the  effects  of  operating  system  overhead,  virtual  memory,  and  compiler  technology  assumed  for  the 
benchmarks  reported.  Another  study  [Hansen  82b]  attempts  to  demonstrate  the  overhead  associated  with  a 
heavily  object-oriented  architecture  such  as  the  iAPX  432.  This  study  found  that  a  4  MHz  432  runs  about  an 
order  of  magnitude  slower  than  other  processors  such  as  the  8  MHz  68000  and  the  VAX  11/780.  But  we  find 
it  hard  to  draw  conclusions  from  this  for  the  following  reasons: 

•  The  object  orientation  of  the  432  indisputably  contributes  heavily  to  the  reported  performance 
degradation.  But  what  percentage  of  the  slowdown  is  attributable  to  the  transparent  multiprocess¬ 
ing  capability  that  is  built  in  to  the  432?  In  general,  one  might  expect  such  a  machine  to  exhibit 
degraded  performance  compared  to  a  more  conventional  uniprocessor,  but  then  the  more  fair 
comparison  would  be  between  several  432'$  running  in  a  system  vs.  other  microprocessors. 

•  Is  the  performance  degradation  due  to  the  432’s  complex  instruction  set,  or  to  the  object  orien¬ 
tation? 

•  If  the  simple  432  instructions  run  more  slowly  because  of  complexity  (a  general  RISC  argument) 
do  the  432*s  complex  instructions  buy  any  of  that  performance  back?  Such  instructions  were  not 
studied  in  [Hansen  82b). 

The  choice  of  whether  or  not  to  make  a  machine  object-oriented  must  be  based  on  many  factors  other  than 
performance.  We  are  primarily  interested  in  evaluating  the  tradeoffs  inherent  in  complex-instruction-set 
architectures,  and  are  using  the  432  as  a  vehicle  to  this  end.  We  would,  however,  like  to  split  the  overhead 
due  to  object-orientation  away  from  the  overhead  due  to  the  machine’s  complexity.  We  feel  that  this  would 
be  a  much  more  useful  evaluation  of  the  tradeoffs  made  in  the  432.  We  therefore  propose  the  following 
experiments: 

•  Advantages  pf  object-oriented  support  in  hardware:  According  to  a  basic  tenet  of  RISC 
philosophy,  simple  432  instructions  ought  to  exhibit  a  performance  penalty  due  to  the  innate 
complexity  of  that  machine.  We  will  investigate  this  by  migrating  the  432’s  simple  instructions 
plus  their  object-oriented  overhead  to  other  machines.  This  will  illustrate  the  effects  of  com¬ 
plexity  on  simple  instructions  without  allowing  the  object-oriented  overhead  to  skew  the  results. 

We  are  also  aware  of,  and  have  to  account  for  technology,  compiler,  and  data  type  differences. 

•  The  effects  of  complex  instructions  in  hardware:  According  to  traditional  computer  architecture 
design,  migrating  software  functionality  closer  to  hardware  should  improve  its  performance.  If 
that  principle  holds  in  the  432,  the  complex  instructions  of  that  machine  ought  to  exhibit  im¬ 
proved  performance  vs.  software  implementation  of  the  same  functions  on  other  machines.  Well 
investigate  this  in  the  same  way  by  moving  the  432'$  complex  instructions,  including  the  object- 
oriented  overhead,  to  other  machines. 

To  date  we  have  developed  an  ISPS  [Barbacci  80]  description  of  the  432  GDP  and  we  are  currently  adding 
the  complex  instruction  routines  to  it  Using  the  Ada  description  of  the  432  microcode  algorithms,  we  will 
next  begin  implementing  some  of  the  432  instructions  on  the  68000  and  the  VAX. 


6.4.3  Evaluating  Multiple  Register  Sets 

Many  computer  architectures  use  register  sets  to  provide  a  fast  means  of  accessing  operands  using  short 
addresses.  Usually,  the  contents  of  these  registers  hold  procedure-relevant  values.  At  procedure  boundaries, 
these  registers  are  reloaded  for  a  new  set  of  values.  This  is  done  by  moving  the  registers'  contents  to  and  from 
main  memory.  To  reduce  the  delay  of  such  memory  transfers,  it  is  possible  to  implement  several  logically 
identical  register  sets  and  to  switch  among  them  at  procedure  boundaries.  (This  is  also  true  of  memory-to- 
memory  machines  that  hold  their  procedure-relevant  values  in  areas  of  main  memory.)  This  type  of  im¬ 
plementation  technique  is  what  we  define  as  multiple  register  sets  (MRSs).  It  is  further  possible  to  reduce 
memory  transfer  operations  by  physically  overlapping  the  logically  separate  register  sets  of  a  calling  procedure 
and  its  called  procedure  to  allow  "free"  parameter  passing  between  them.  This  type  of  structure  will  be 
referred  to  as  an  overlapped  register  set  (ORS),  and  is  viewed  here  as  an  extended  type  of  MRS. 

Ideally,  the  goal  of  this  second  study  would  be  to  answer  the  question: 

What  are  the  effects  and  costs  involved 
in  incorporating  multiple  register  sets 
in  a  computer  architecture? 

This  question  could  further  be  broken  down  into  these  five  issues: 

1.  In  what  ways  is  an  architecture's  performance  changed  by  incorporating  multiple  register  sets? 

2.  What  changes  are  necessary  to  a  machine’s  instruction  set  and  internal  structures  to  support  such 
register  sets? 

3.  How  do  multiple  register  sets  affect  the  task  of  writing  a  compiler  for  an  architecture? 

4.  What  is  the  impact  of  multiple  register  sets  on  a  machine’s  need  for  quick  context  swaps? 

5.  How  does  the  choice  of  high-level  language  or  application  affect  the  usefulness  of  multiple 
register  sets? 

Finding  complete  answers  to  all  of  these  questions  is  beyond  the  scope  of  what  we  wish  to  accomplish  at  this 
time.  In  limiting  the  goals  of  this  research,  we  see  the  first  of  these  five  questions  as  being  most  important  to 
address  and  we  plan  to  give  it  most  of  our  effort.  Although  the  other  areas  are  of  interest,  some  may  not  be 
pursued.  In  the  following  sections,  each  of  these  five  areas  of  interest  will  be  outlined,  with  particular  detail 
given  to  the  first 

It  should  be  noted  that  comparing  RISCs  and  CISCs  is  not  a  primary  objective  of  this  work.  While  we  hope 
to  learn  something  about  the  relative  performance  and  requirements  of  machines  that  differ  in  instruction  set 
complexity,  this  study  concentrates  on  the  relative  performance  and  requirements  of  the  same  basic  architec¬ 
ture  with  and  without  MRSs.  Any  light  shed  on  the  RISC/CISC  debate  will  occur  as  a  secondary  result  of 


this  work.  This  research  also  docs  not  relate  to  many  other  important  aspects  of  the  RISC  I  machine  in 
particular.  For  example,  data  path  area  and  man-months  of  design  time  are  not  concerns  in  this  study. 

6.4.3. 1  Performance  Gains 

The  major  reason  for  incorporating  MRSs  in  a  machine  is  to  reduce  the  number  of  accesses  required  of 
main  memory.  To  do  this,  each  called  procedure  is  given  its  own  set  of  registers  and  the  most  recent 
procedures'  states  are  kept  in  the  other  register  sets.  In  this  sense  the  register  sets  cache  the  state  of  many 
procedures  before  the  calls  (or  returns)  overflow  (or  underflow)  the  register  sets'  capacity.  Many  register 
loads  and  stores  can  be  saved  because  a  return  will  often  recall  a  procedure  whose  state  is  in  a  register  set,  and 
a  call  will  often  find  an  empty  register  set  available.  This  is  because  the  procedure  call/ return  patiems  of 
most  block  structured  high-level  language  programs  exhibit  an  certain  amount  of  "locality.”  What  "locality" 
means  here  is  that  the  call  depth  of  a  program  often  varies  about  some  leveL 

Data  caches  and  stack  structures  are  other  approaches  used  to  reduce  the  number  of  memory  accesses 
required  by  a  computer.  While  comparative  studies  among  these  approaches  and  MRSs  would  be  instructive, 
they  are  not  of  primary  interest  to  us.  We  would  ultimately  like  to  evaluate  the  tradeoffs  involved  with 
instruction  set  complexity.  MRSs  is  a  technique,  orthogonal  to  instruction  set  complexity,  that  afFects  the 
performance  of  any  general-purpose  register  machine,  as  does  a  data  cache.  This  study  aims  to  characterize 
those  performance  effects  so  that  they  can  be  removed  from  the  RISC/CISC  comparisons  of  register-oriented 
machines  where  they  don't  belong.  Since  we  are  interested  in  register  machine  comparisons,  stack  machines 
are  also  not  relevant  here.  Comparisons  of  stack  and  register  architectures  can  be  found  elsewhere  [Myers  82]. 

A  technique  similar  to  MRSs  is  used  to  reduce  process  swap  time  in  some  machines.  In  these  machines 
there  are  also  many  groups  of  the  architecture's  logical  register  set  but  each  set  is  used  to  contain  the 
procedure  state  of  a  different  process.  This  way,  switching  among  processes  can  consist  of  no  more  than 
changing  register  sets.  This  is  done  on  machines  like  the  Sigma  7  [Sigma  68],  which  can  have  the  state  of  as 
many  as  32  processes  in  registers,  and  the  Dorado  [Lampson  80],  which  can  change  process  state  on  every 
machine  cycle. 

There  are  three  distinct  factors  which  interact  to  provide  performance  gains  for  MRS  machines: 

1.  fewer  memory  accesses  for  storing  and  restoring  procedure  state  are  needed  (by  having  more  than 
one  physical  register  set) 

2.  fewer  memory  accesses  for  passing  parameters  between  procedures  are-needed  (by  having  ORSs) 

3.  register  sets  that  are  associated  with  the  processor  are  usually  faster  than  register  sets  that  are 
stored  in  main  memory  (as  is  done  in  the  BELLMAC-8  and  the  T1 9900) 


These  three  performance  factors  arc  orthogonal  in  nature.  As  such,  they  can  be  experimentally  measured 
separately.  A  group  of  single  register  set  machines  can  be  compared  to  similar  machines  that  arc  modified  to 
incorporate  MRSs.  These  comparisons,  which  would  be  based  on  some  set  of  benchmark  programs,  would 
gauge  the  effects  of  the  first  factor.  This  can  be  done  using  simulations  of  the  machines  and  their  MRS* 
modified  versions.  Either  compiler  modifications  would  be  needed  to  create  the  code  for  the  modified 
machines,  or  assembly  versions  of  benchmark  programs  would  be  changed  by  hand  to  reflect  what  a  compiler 
could  do.  These  MRS  machines  can  then  be  further  modified  so  that  each  register  set  has  a  fixed  overlap  with 
the  register  sets  of  the  previous  and  next  procedures.  These  overlapping  registers  would  be  used  for  passing 
parameters  between  procedures,  as  the  RISC  1  does.  Again,  simulations  can  be  conducted  using  the  same 
benchmarks.  The  results  of  this  set  of  simulations  would  give  a  generalized  view  of  the  second  factor's  impact 
Since  these  simulated  machines  would  only  be  compared  against  modified  versions  of  themselves,  there  is  no 
need  to  consider  differences  between  machines  in  register  set  size  or  in  compiler  optimizations  or  in  im* 
plementation  techniques.  Having  a  uniform  method  of  managing  the  MRSs  is  important  Studies  of  such 
methods  have  been  made  [Tamir  83,  Halbert  80],  The  results  of  such  studies  will  be  used;  no  attempt  will  be 
made  to  find  independent  conclusions  in  this  regard. 

The  third  contributor  to  the  performance  of  some  MRS  machines,  fast  register  access,  contributes  to  all 
machines  that  have  their  register  sets  associated  with  the  processor.  Since  this  research  is  not  concerned  with 
exploring  the  merits  of  such  register  architectures  against  those  of  memory- co-memory  machines,  this  factor 
will  not  be  examined.  Other  researchers  have  been  interested  in  this  topic  [Myers  82]  and  various  machines 
have  been  proposed  with  memory  structured  in  novel  ways  [Ditzel  82.  Patterson  80b]. 

In  the  two  sets  of  experiments  described  above,  certain  assumptions  would  have  to  be  made  about  how  the 
modified  versions  of  the  machines  are  structured.  The  hazards  involved  in  these  decisions  cannot  be  fully 
anticipated,  but  their  soundness  is  critical  for  useful  results. 

Another  approach  could  be  taken  to  determine  the  effects  MRSs  and  ORSs  on  machine  performance.  It 
would  be  possible  to  run  traces  of  benchmark  programs  that  would  tell  how  many  calls  were  made  to  each 
procedure  and  that  would  give  a  call/retum  profile  of  the  programs.  With  this  information  it  would  be 
possible  to  calculate  the  instruction  cycles  saved  by  using  MRS  and  ORS  techniques.  While  this  approach 
would  answer  our  questions,  a  good  simulator  with  all  the  necessary  event  counting  mechanisms  is  available 
to  us,  ISPS  [Barbacci  80],  With  it,  simulations  can  be  easily  modified  to  gather  any  runtime  statistic  that  might 
be  useful. 

The  choice  of  benchmarks  is  an  important  consideration  in  this  experiment  Benchmarks  from  the  RISC  I 
project  could  be  used  with  the  following  advantages:  many  results  already  exist,  they  would  provide  a  means 
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of  checking  some  of  our  simulations,  and  they  arc  written  in  C.  a  language  with  many  support  tools  at  CMU. 
Because  they  arc  written  in  C.  these  results  would  also  reflect  the  programming  biases  supported  by  this 
language.  (See  section  6.2.5  for  more  on  this.)  'ITicy  would  also  produce  results  that  would  be  useful  to  many 
computer  designers.  However,  we  are  ultimately  interested  in  the  performance  of  primitives  used  in  the 
Anchons  system.  Benchmarks  that  reflect  performance  on  these  primitives  are  under  investigation,  but  may 
not  be  developed  in  time  for  use  in  this  study. 

For  these  two  sets  of  experiments,  we  will  be  using  simulations  of  at  least  the  following  machines: 

•  RISC  1:  An  initial  ISPS  description  of  this  machine  has  been  created  and  is  being  refined.  Since 
it  already  is  an  ORS  machine,  the  modification  experiments  will  involve  removing  its  register  set 
overlap  and  giving  it  a  single  register  set. 

•  68000:  An  ISPS  description  for  this  processor  is  almost  complete.  A  C  compiler  exists  at  CMU 
that  could  be  modified  for  these  experiments.  Unfortunately,  the  68000’s  registers  are 
dichotomized  into  data  and  address  registers.  Care  must  be  given  to  creating  a  reasonable  MRS 
version  for  this  reason. 

•  VAX:  An  ISPS  description  for  the  VAX  already  exists,  as  does  a  C  compiler.  It  is,  however,  a 
very  complex  processor  and  creating  a  .valid  modified  version  might  present  some  problems. 

We  are  also  considering  using  BELLMAC*8  and  Nebula  [Szewerenko  81]  simulations  in  these  experiments. 

Two  other  RISC  machines  of  note  are  not  being  considered  for  this  study:  the  IBM  801  [Radin.G  82]  and 
MIPS  [Hennessy  82].  No  detailed  information  on  the  801  is  available  due  to  its  proprietary  nature.  The 
MIPS  machine  presents  simulation  complexities  due  to  its  pipelined  nature  although  it  could  be  a  target  for 
later  experiments. 

6.4. 3. 2  Machine  Support  Requirements 

The  RISC  1  processor  has  no  special  instructions  to  help  it  manage  its  register  file.  It  has  an  internal  trap 
mechanism  that  is  used  to  detect  underflows  and  overflows  of  its  register  file.  It  is  possible  to  see  this  machine 
as  providing  minimal  support  for  its  on-chip  register  file,  truly  in  the  spirit  of  RISC.  It  is  also  possible  to 
imagine  other  support  mechanisms,  in  hardware  and  software,  that  would  contribute  to  the  management  of 
the  register  file.  No  experiments  are  proposed  to  analyze  the  possible  mechanisms  for  such  support.  Instead, 
it  would  be  useful  to  see  how  the  machine  descriptions  used  in  the  previous  experiments  were  modified,  or 
how  they  might  have  been,  to  support  MRS  machines.  In  general  this  part  of  the  research  would  consist  of 
categorizing  the  various  means  of  supporting  MRSs  and,  perhaps,  of  estimating  their  impacts  on  performance 
and  cost 


6. 4. 3. 3  Impact  on  Compiler  Writing 

Having  MRSs  in  a  machine  has  only  a  small  effect  on  compiler  writing.  Most  significantly,  the  code* 
generation  phase  is  changed  to  take  advantage  of  better  procedure  linkages.  Also,  a  system  package  might 
have  to  be  generated  that  would  manage  ovcrflow/underflow  traps.  This  code  would  determine  the  cause  of 
the  trap  and  would  dispatch  control  to  the  proper  procedure  that  stores  or  restores  register  windows  in  the 
case  of  an  internal  trap. 

When  the  simulations,  which  were  described  in  the  section  on  performance  gains,  are  executed,  then  a 
working  knowledge  of  the  code  changes  necessary  will  be  developed.  Any  insight  developed  into  coding 
differences  of  significance  would  be  reported  in  this  part  of  the  research.  No  specific  experiments  or  analyses 
would  be  added. 

6.4. 3. 4  Usefulness  of  Response 

With  so  much  more  state  inside  a  processor  that  has  multiple  register  sets,  the  time  required  to  store  and 
load  all  of  a  machine’s  registers  is  increased  dramatically.  The  RISC  I  architecture  goes  from  a  minimum  of 
35  architectural  32-bit  registers  of  state  to  125  when  its  full  register  file  needs  to  be  saved.  This  increase  of 
internal  state  brings  two  questions  to  mind: 

1.  Does  the  increase  in  process  swap  time  become  significant  in  general  multiprogrammed  applica¬ 
tions  or  in  real  time  environments? 

2.  Is  there  an  alternative  to  haying  to  store  the  internal  state  of  the  processor  at  each  change  of 
context? 

The  first  question  can  be  answered  by  finding  statistics  regarding  the  demands  of  a  variety  of  systems  (rate 
and  distribution  in  time  of  context  swaps).  It  should  be  easy,  if  these  numbers  can  be  found,  to  find  the 
situations  where  the  increased  swap  delay  is  unacceptable.  This  study  will  make  no  efforts  to  address  the 
second  question. 

6.4. 3.5  Language  Effects  • 

Saving  memory  accesses  by  having  more  register  state  is  not  always  possible.  Due  to  scoping  rules,  as  well 
as  indirect  accesses  via  ’’pointers”  as  in  C.  some  variables  are  not  well  suited  for  storage  in  MRS  machines. 
This  leads  to  either  special  compiler  changes  or  to  special  hardware  mechanisms  that  slow  such  references. 
The  effectiveness  of  MRSs  also  depends  on  the  characteristics  of  the  applications  to  be  run  on  the  machine. 
While  such  aspects  of  languages  and  application  determine  how  well  MRSs  can  be  utilized,  such  generaliza¬ 
tions  are  not  goals  of  this  study. 
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7.  Interim  Decentralized  System  Testbed 


7.1  Overview 

The  purpose  of  the  Archons  interim  testbed  is  to  support  the  implementation  and  experimental  evaluation 
of  algorithms  for  decentralized  resource  management  and  to  support  the  development  of  a  prototype 
decentralized  operating  system.  ArchOS,  incorporating  an  integrated  set  of  these  algorithms. 

The  long-range  plan  requires  that  Archons  project  hardware  be  specifically  designed  to  provide  support  for 
the  ArchOS  software  (see  Chapter  6).  However,  at  this  point  we  do  not  have  a  version  of  the  ArchOS 
software  on  which  to  base  any  hardware  support  requirements.  As  a  result,  we  are  constructing  an  interim 
testbed  facility  on  which  experiments  (mostly  dealing  with  software,  but  not  excluding  hardware)  can  be 
performed.  This  system  has  been  designed  to  supply  some  general  capabilities  in  order  to  support  the 
development  of  the  initial  ArchOS  operating  system  without  requiring  that  we  design  special-purpose 
hardware. 


7.2  System  Selection 

During  the  period  from  January  to  May  1983,  we  evaluated  alternatives  for  our  testbed  system.  Since 
ArchOS  is  a  decentralized  operating  system,  it  was  decided  that  the  interim  testbed  hardware  should  be  a 
collection  of  processing  nodes  interconnected  by  an  Ethernet  to  form  a  local  area  network.  Based  on  various 
hardware  and  software  considerations,  including  availability  of  compatible  off-the-shelf  hardware  and 
software,  and  compatibility  with  other  research  efforts,  we  chose  the  Sun  Microsystems,  Inc.  (SMI)  Worksta¬ 
tion  as  the  processing  node  for  the  system. 

The  Suns  fulfill  the  general  requirements  that  were  formulated  at  the  beginning  of  the  interim  testbed 
effort,  specifically: 

•  Motorola  68000  processor 

•  UNIX  operating  system: 

•  high-level  language  support; 

•  10Mbit  Ethernet; 

•  hardware  expandability. 

The  Sun  workstation  uses  the  Motorola  MC68010,  a  version  of  the  68000  with  support  for  virtual  memory 
management  The  68000  is  a  popular  processor  with  a  large  software  base. 


Suns  arc  supplied  with  the  DARPA  standard  Berkeley  UNIX,  version  4.2bsd,  including  system  source  code. 
This  system  provides  a  powerful  software  development  environment.  In  addition,  it  has  extensive  networking 
facilities,  which  arc  useful  for  supporting  distributed  software  experiments.  The  4.2bsd  system  also  runs  on 
the  DEC  VAX-11  computers  in  the  Computer  Science  Department,  so  we  can  take  advantage  of  locally 
developed  software,  and  of  project  members'  experience  with  the  system. 

Compilers  for  C,  Berkeley  Pascal,  and  Fortran  77  are  supplied  with  the  Berkeley  UNIX  4.2bsd  system. 
Both  Berkeley  Pascal  and  Fortran  77  can  call  C  routines,  and  the  runtime  support  systems  for  these  languages 
are  written  in  C,  so  modifying  the  operating  system  dependent  para  of  the  runtime  support  to  use  BBN's 
CMOS  system  calls  will  make  it  possible  to  write  programs  to  run  under  CMOS  in  any  of  these  languages. 
CMOS  is  simple  operating  system  kernel  written  in  C  that  provides  low-level  support  for  multiple  processes, 
interprocess  communication/coordination,  asynchronous  I/O,  memory  allocation,  and  system  clock  manage¬ 
ment 

The  10Mbit  Ethernet  is  a  standard  high-speed  inter-node  communications  medium.  The  TCP/IP  Ethernet 
software  supplied  with  4.2bsd  will  allow  Ethernet  file  transfers  between  the  testbed  system  and  the  Computer 
Science  Department’s  machines.  It  also  supports  network  virtual  disks,  making  it  possible  for  a  single  disk 
server  to  support  a  number  of  diskless  workstations. 

Since  the  Multibus  is  used  as  the  system  bus.  the  Sun  workstation  is  expandable.  We  may  easily  acquire 
off-the-shelf  hardware  or  build  new  boards  which  can  be  added  to  the  system;  also,  since  Multibus  supports 
multiple  bus  masters,  we  have  the  capability  to  add  a  second  CPU  card  in  a  single  workstation,  thereby  more 
closely  approximating  the  hardware  of  the  final  Archons  testbed  facility. 

The  second  major  reason  for  selecting  the  SMI  hardware  was  to  facilitate  the  sharing  of  software  with  other 
experimenters  using  similar  development  systems.  In  particular,  we  are  interested  in  cooperating  with  the 
work  being  carried  out  by  Bolt-  Beranek  and  Newman  in  the  area  of  distributed  operating  systems.  We  are 
beginning  to  examine  their  C70/UNIX  Distributed  Operating  System  to  determine  how  it  may  be  moved  into 
the  Archons  interim  testbed  system  environment  The  first  step  will  be  to  examine  BBN  UNIX  and  compare 
it  to  Berkeley  UNIX  4.2bsd. 

We  anticipate  that  some  experiments  will  not  require  and  might  be  hindered  by  the  presence  of  a  large  and 
complex  operating  system.  For  this  reason,  we  intend  to  provide  an  intermediate  level  of  support  between  the 
bare  machine  and  the  full  4.2bsd  system.  For  this  purpose,  we  currently  plan  to  use  the  BBN  CMOS  system. 
CMOS  is  an  open  operating  system  kernel  in  the  sense  that  there  are  no  security  barriers  between  the  OS  and 
the  user  program.  This  feature  gives  us  full  flexibility  for  low-level  software,  while  providing  a  minimum 

v 

level  of  system  services  to  programs  that  need  them. 
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Wc  also  recognize  a  need  for  performance  measurement  tools  in  both  4.2bsd  and  CMOS.  Aside  from  a 
simple  execution  profiler  included  in  UNIX,  we  plan  to  investigate  systems  that  arc  better  suited  for  dis¬ 
tributed  performance  monitoring  or  debugging.  One  possibility  that  seems  promising  is  a  distributed 
monitoring  system  developed  at  CMU  for  the  Cm*  project  [Snodgrass  82J. 

7.3  Current  Status  and  Future  Plan 

We  are  beginning  the  integration  work  required  to  construct  the  Archons  interim  testbed  facility.  This 
work  is  being  carried  out  on  three  Sun  Workstations  that  have  been  loaned  to  us  until  the  hardware  to  be 
purchased  specifically  for  our  work  arrives.  We  have  connected  the  interim  testbed  system  with  the  CMU 
CSD’s  ethemet  cable  so  that  the  testbed  system  can  interact  with  the  rest  of  the  CMU  CSD  computing 
facilities.  The  most  important  on  going  tasks  are:  to  learn  how  new  hardware  can  be  added  to  the  system  (in 
particular,  to  learn  how  to  write  device  drivers  for  UNIX  4.2bsd);  and  to  bring  up  a  small,  stand-alone 
operating  system  kernel  to  be  used  for  low-level  operating  system  experimentation. 

One  unresolved  issue  is  how  changes  in  the  execution  environment  can  be  handled  most  efficiently.  Al¬ 
though-  it  is  certainly  possible  to  reboot  the  hardware  with  a  different  environment  (such  as  the  UNIX 
operating  system,  the  C 70/UNIX  DOS.  or  a  ArchOS  standalone  experimental  environment)  each  time  a 
change  is  desired,  we  hope  to  avoid  such  an  inconvenient  approach.  But,  if  we  can’t  avoid  this  approach,  then 
we  must  make  the  method  as  convenient  as  we  can.  For  instance,  it  may  be  feasible  for  some  processing 
nodes  to  be  used  for  program  development  while  others  are  running  experiments. 

As  we  move  on  to  our  search  for  appropriate  hardware  architectures  for  a  decentralized  computer  system,  it 
may  be  possible  to  use  our  interim  testbed  to  simulate  alternative  hardware  configurations.  For  example, 
routines  can  be  written  that  will  make  it  appear  that  the  nodes  are  connected  by  several  buses,  allowing 
experimentation  with  ArchOS  handling  of  individual  "bus  failures  and  recovery**.  We  could  also  insert  a 
context  swapping  mechanism  that  would  give  the  appearance  of  several  processors  at  each  node  to  allow  us  to 
test  that  ArchOS  can  actually  tolerate  OS  concurrency,  even  at  an  individual  node.  Our  initial  ArchOS  will 
actually  execute  directly  on  the  interim  testbed  hardware,  but  our  work  on  experiment  control  and  monitoring 
tools  will  be  done  with  the  objective  of  being  transportable  to  a  simulated  hardware  situation. 
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A.Annotated  Bibliographies 


A.1.  Decentralized  Operating  Systems 

[Aiso  75]  Aiso,  H.;  Tokuda,  H.;  Ishizuka,  A.;  Kamibayashi,  N.;  Takeyama,  A. 

The  System  Software  for  KOCOS. 

In  Proceedings  of  the  IFIP  TC-2  Working  Conference  on  Software  for 
Minicomputers,  IFIP,  September,  1975. 

Abstract 

This  is  a  study  on  the  system  software  of  the  minicomputer  complex.  KOCOS  (Keio-Oki's  Complex  System).  The 
purpose  of  this  system  is  to  first  realize  resource  and  load  sharing  in  the  heterogeneous  minicomputer  complex. 
Finally,  the  purpose  is  to  realize  parallel  processing  through  organic  integration  of  resources.  This  system  is 
characterized  by  the  following  two  points:  first,  the  system  software  is  composed  of  two  modules.  One.  called 
System  Scheduler,  controls  all  the  static  system  resources  in  a  centralized  manner,  and  the  other,  called  Local 
Operating  System,  distributively  takes  care  of  the  execution  of  processes  on  each  minicomputer.  Secondly,  the 
interprocess  communication  facility  has  been  realized  through  positive  utilization  of  microprocessors  and  is  -rich 
both  in  flexibility  and  expandability.  This  paper  outlines  the  system  configuration,  structure  of  System  Scheduler 
and  Local  Operating  System,  and  the  interprocess  communication  facility. 

[Allchin  83]  Janies  E.  Allchin  and  Martin  S.  McKendry.  ’ 

Support  for  Objects  and  Actions  in  Clouds. 

Technical  Report  GIT-ICS-83/1 1,  Georgia  Institute  of  Technology,  May,  1983. 

Abstract 

This  status  report  describes  the  current  work  of  the  Clouds  protect  at  Georgia  Tech.  The  Clouds  project  is  studying 
techniques  for  construction  of  reliable  computing  systems  in  environments  of  distributed  machines  interconnected 
by  local  area  networks.  This  report  emphasises  the  functional  requirements  for  architectural  support  To  support 
reliability,'  the  architecture  supports  objects  and  actions.  Objects  are  instances  of  abstract  data  types.  They 
provide  a  basis  for  building  system  components  and  for  controlling  the  behavior  of  a  system  when  failures  occur. 
Atomic  actions  are  a  means  of  dynamically  grouping  invocations  of  operations  on  objects  into  units  of  work  that 
either  complete  in  their  entirety  or  do  not  have  any  effect  whatsoever.  Recovery  mechanisms  assist  in  maintaining 
this  abstraction  and  synchronization  mechanisms  control  interactions  between  actions. 

[Aimes  83]  Aimes,  G.  T. 

Integration  and  Distribution  in  the  Eden  System. 

In  IEEE  International  Workshop  on  Computer  Systems  Organization  (New  Orleans 
LA),  pages  62-71.  IEEE,  March  29-31, 1983. 

Abstract 

Although  locally  distributed  computer  systems  are  becoming  increasingly  common  and  attractive,  operating 
systems  designers  have  paid  little  attention  to  the  special  needs  and  opportunities  of  these  systems.  The  Eden 
project  is  one  of  the  few  attempts  to  design  an  operating  system  appropriate  to  these  needs  and  opportunities. 
This  paper  describes  the  approach  taken  by  the  Eden  project  in  designing  a  system  both  appropriate  to  a  specific 
class  of  computer  systems  and  supportive  of  a  modem  software  distributed  hardware  base  and  a  logically 
integrated  operating  system. 

[Andre  82]  Andre,  J.  P.;  Petit,  J.  C.;  Derriennic-Le  Corre,  H. 

Dynamic  Software  Reconfiguration  in  a  Distributed  System  (Gaiaxie). 

In  IEEE  International  Conference  on  Communications.  ICC  '82:  The  Digital 
Revolution  (Philadelphia  PA),  pages  5G.4.1-5G.4.5.  IEEE,  June  13-17, 1982. 

Abstract 

Distributed  architectures  are  becoming  very  attractive  in  building  complex  softrware  systems,  such  as  control  for 
switching  systems.  To  obtain  the  benefit  of  the  inherent  advantages  of  such  architectures,  e.g.  graceful 
degradation,  extensibility  and  adaptability,  basic  concepts  of  distribution  in  operating  systems  must  be  specified 
and  experiments  performed.  This  paper  deals  with  a  system  (Gaiaxie)  aiming  at  an  experimental  implementaion  of 
these  concepts,  mainly  in  the  fields  of  dynamic  software  allocation.  Moreover,  in  order  to  provide  levels  of 
abstraction  with  regard  to  the  organization  of  the  underlying  hardware  and  network  architecture,  the  authors 
present  a  modular  and  hierarchical  operating  system  model. 


[ Applewhite  82]  Applewhite,  Hugh  L;  Garg,  Roli;  Jensen,  E.  Douglas;  Northcutt,  J.  Duane;  Sha,  Lui. 
Decentralized  Resource  Management  in  Distributed  Computer  Systems. 

Technical  Report  RADC-TR-81-203,  Rome  Air  Development  Center,  Griffiss  AFB, 
NY,  February,  1982. 

Abstract 

This  is  the  first  technical  report  from  the  Archons  project,  which  is  performing  research  in  the  science  and 
engineering  of  ‘distributed  computers'.  By  this  we  mean  a  computer  having  a  highly  decentralized  (e.g., 
consensus)  resource  management  at  every  level  of  abstraction  from  the  executive  down.  This  report  provides  a 
snapshot  of  several  incomplete,  ongoing  investigations:  decentralized  synchronization;  the  requirements  for 
simulation  of  decentralized  resource  management  algorithms;  and  the  facilities  to  be  provided  by  a  decentralized 
executive.  We  begin  with  a  summary  of  our  views  on  decentralized  resource  management  and  control,  and  the 
implications  of  physical  communications  on  control  (especially  at  the  executive  level).  Then  we  briefly  survey 
several  other  distributed  system  protects.  This  bnngs  the  Archons  project  into  closer  focus,  as  their  orientations 
and  obiectives  are  considerably  different  from  ours.  Synchronization  (the  induction  of  a  common,  consistent 
ordering  on  events)  is  the  essence  of  decentralized  control.  New  concepts  and  techniques  are  required  to  achieve 
synchronization  in  distributed  computers  without  reliance  on  any  centralized  entity  such  a  semaphore,  monitor, 
sequencer,  or  bus  arbiter. 

[Ayache  82a]  Ayache,  J.  M.;  Courtiat,  J.  P.;  Diaz,  M.;  Michelena,  J. 

Software  and  Protocols  in  REBUS.  A  Distributed  Real-Time  Control  System. 

In  Sottware  for  Computer  Control  79 82,  Proceedings  of  the  Third  IFAC/IFIP 
Symposium  (Madrid.  Spain),  pages  147-153.  IFAC/IFIP,  October  5-8, 1982. 

Abstract 

REBUS  is  a  robust  and  fault  tolerant  cooperation  system  for  a  local  real  time  control  microcomputer  network.  It  is 
being  developed  at  the  LAAS  in  connection  with  the  industrial  real  time  control  system  MODUMET  800  of 
Schlumberger-Europe.  Based  on  a  general  hardware  architecture,  the  design  of  REBUS  emphasizes  the  aspects 
of  cooperation  and  fault  tolerance  as  required  in  local  real  time  control  networks  and  it  is  primarily  concerned  with 
the  problems  of  specification,  validation,  and  implementation  of  some  standard  and  specific  protocols.  After  a 
short  presentation  of  the  hardware  architecture,  the  various  software  levels  are  described;  they  include  the 
operating  system  kernel  of  die  processors,  the  line,  network  and  transport  layers,  and  the  remote  call  mechanism. 
Finally,  a  tool,  the  observer  developed  for  protocol  debugging  and  measure  purposes  is  also  presented. 

[Ayache  82b]  Ayache,  F.  M.;  Courtiat,  J.  P.;  Diaz,  M. 

REBUS,  A  Fault-Tolerant  Distributed  System  for  Industrial  Real-Time  Control. 

IEEE  Transactions  on  Computing  C-31(7):637-647,  July,  1982. 

Abstract 

Presents  a  fault-tolerant  distributed  system  designed  for  real-time  control  applications  (REBUS),  which  is  one  of 
the  research  basis  of  the  industrial  real-time  system  MODUMAT  800.  It  is  made  up  of  functional  units,  i.e. 
programmable  multiloop  regulators  and  operator  displays,  linked  together  by  a  communication  structure.  The 
communication  hardware  consists  of  a  set  of  serial  bus  interface  boards,  one  per  functional  unit,  loosely  coupled 
together  by  a  double  serial  bus  and  linked  to  their  functional  units  by  a  private  parallel  bus.  The  communication 
software,  implemented  on  each  interface  board,  provides  a  distributed  executive  based  on  a  reliable  link  protocol 
and  a  robust  bus  allocation  mechanism.  Different  fault-tolerant  mechanisms  are  implemented  in  order  to  achieve 
the  dependability  requirements  of  industrial  control  systems. 

[Balf  76]  Ball,  J.  E.;  Feldman,  J.;  Low,  J.  R.;  Rashid,  R.;  Rovner,  P. 

RIG,  Rochester’s  Intelligent  Gateway:  System  Overview. 

IEEE  Transactions  on  Software  Engineering  SE-2(4);321-328,  December,  1976. 

Abstract 

Rochester's  Intelligent  Gateway  (RIG)  system  provides  convenient  access  to  a  wide  range  of  computing  facilities. 
The  system  includes  five  large  minicomputers  in  a  very  last  internal  network,  disk  and  tape  storage,  a 
printer/plotter  and  a  number  of  display  terminals.  These  are  connected  to  larger  r.ampns  machines  (IBM  360/65 
and  DEC  KL10)  and  to  the  ARPANET,  The  operating  system  and  other  software  support  for  such  a  system  present 
some  interesting  design  problems.  This  paper  contains  a  high-level  technical  discussion  of  the  software  designs, 
many  of  which  will  be  treated  in  more  detail  in  subsequent  reports. 
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[Ball  82] 


Ball,  J.  E.;  Barbacci,  M.  R.;  Fahlman,  S.  E.;  Harbison,  S.  P.;  Hibbard,  P.  G.;  Rashid, 

R.  F.;  Robertson,  G.  G.;  Steele,  G.  L.  Jr. 

The  Spice  Project. 

Technical  Report,  Computer  Science  Research  Review,  Carnegie- Mellon 
University,  1982. 

Abstract 

The  long-range  aim  o(  the  Spice  protect  is  to  create  a  departmental  personal  computing  environment  that  will  be 
usable  through  the  1990's.  Development  and  hardware  aquisition  will  be  spread  over  about  tour  years,  so  that  by 
1985  most  departmental  research  and  ordinary  computing  will  be  performed  on  personal  computers.  The 
computers  will  be  connected  together  by  one  or  more  high  bandwidth  local  networks,  to  access  each  other  and  to 
access  central  services  such  as  printers  and  file  servers.  Gateways  will  be  available  to  other  networks,  such  as  the 
ARPAnel.  Specialized  devices  will  be  attached  to  the  network  for  particular  projects.  Also  available  will  be  the 
present  timesharing  facilities,  which  will  survive  at  least  though  a  transition  period. 

[Bane  81]  Bane,  R.;  Stan  fill,  C.;  Weiser,  M. 

Operating  System  Strategy  on  ZMOB. 

In  1981  IEEE  Computer  Society  Workshop  on  Computer  Architecture  lor  Pattern 
Analysis  and  Image  Database  Management  (Hot  Springs  VA),  pages  125-132. 
IEEE,  November  11-13, 1981. 

Abstract 

The  ZMOB  multiprocessor  computer  will  use  a  distributed  operating  system  with  a  host  controller.  The  operating 
system,  called  MOBIX.  gives  to  the  user  the  image  of  using  an  ordinary  UNIX  system  but  with  truly  parallel  process 
execution.  Individual  ZMOB  processors  can  communicate  directly  with  each  other,  but  hard  system  calls  and 
references  to  global  names  are  referred  to  the  host  for  action.  The  interprocess  communication  protocols  are 
sufficiently  general  to  allow  many  kinds  of  programs,  including  both  synchronous  and  asynchronous  applications. 

[Baskett  77]  Baskett,  F.;  Howard,  J.  H.;  Montague,  J.  T. 

Task  Communication  in  DEMOS. 

In  Proceedings  of  the  Sixth  ACM  Symposium  on  Operating  Systems 
Principles,  pages  23-31 .  ACM,  November,  1977. 

Abstract 

This  paper  describes  the  fundamentals  and  some  of  the  details  of  task  communication  in  DEMOS,  the  operating 
system  for  the  CRAY-1  computer  being  developed  at  the  Los  Alamos  Scientific  Laboratory.  The  communication 
mechanism  is  a  message  system  with  several  novel  features.  Messages  are  sent  from  one  task  to  another  over 
links.  Links  are  the  primary  protected  objects  in  the  system;  they  provide  both  message  paths  and  optional  data 
sharing  between  tasks.  They  can  be  used  to  represent  other  objects  with  capability-like  access  controls.  Links 
point  to  the  tasks  that  created  them.  A  task  that  creates  a  link  determines  its  contents  and  possibly  restricts  its  use. 
A  link  may  be  passed  from  one  task  to  another  along  with  a  message  sent  over  some  other  link  subject  to  the 
restrictions  imposed  by  the  creator  of  the  link  being  passed.  The  link  based  message  and  data  sharing  system  is  an 
attractive  alternative  to  the  semaphore  or  monitor  type  of  shared  variable  based  operating  system  on  machines 
with  only  very  simple  memory  protection  mechanisms  or  on  machines  connected  together  in  a  network. 

[Berg  82]  Berg,  H.  K.;  Smith,  M.  G. 

A  Distributed  System  Experimentation  Facility. 

In  Proceedings  of  the  3rd  International  Conference  on  Distributed  Computing 
Systems  (Miami/Fort  Lauderdale,  FL),  pages  324-329.  IEEE,  October  18-22, 
1982. 

Abstract 

Describes  the  Distributed  System  Testbed  (DTS)  developed  at  the  Honeywell  Corporate  Computer  Sciences 
Center.  The  motivations  for  the  use  of  experimentation  facilities  in  distributed  processing  research  are  recalled, 
and  design  of  DST  are  summarized.  The  concepts  which  are  realized  by  DST  are  summarized.  The  concepts 
which  are  realized  by  DST  are  discussed  with  emphasis  on  the  instrumentation  facilities  and  experiment  control. 
Both  the  system  hardware  and  the  system  software  are  described.  The  discussion  of  the  system  hardware 
highlights  the  node  hardware,  the  interconnection  hardware  and  the  experiment  timing  affordable  by  these 
components.  The  discussion  of  the  system  software  concentrates  on  the  structure  and  concepts  of  the  operating 
system  kernel  and  the  applicability  of  the  kernel  primitives  to  experimentation  with  and  instrumentation  of  the 
testbed. 
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1  [Bernstein  79]  P.  A.  Bernstein,  D.W.  Shipman  and  J.  V.  Rothnie,  Jr. 

'  Concurrency  Control  in  SDD-1:  A  System  for  Distributed  Databases:  Part  I: 

J  Description  - nd  Part  II:  Analysis  of  Correctness. 

I  Technical  Report  CCA-03-79  and  CCA-04-79,  Computer  Corporation  of  America 

Technical  Reports,  January,  1979. 

[Bevan  80]  Bevan,  S.  J. 

,  A  Preliminary  Implementation  of  POSER. 

Technical  Report  DRIC-BR-76603,  Defence  Research  Information  Centre, 

I  Orpington,  England,  September,  1980. 

Abstract 

■  POSER  is  a  process  organisation  to  simplify  error  recovery  intended  tor  use  in  fault  tolerant,  distributed  computer 

systems  running  real-time  programs.  This  memorandum  describes  the  process  organisation  used  in  POSER  and 
how  the  organisation  has  been  experimentally  implemented  m  a  multi-computer  simulation.  Application  program 
design  has  been  studied  by  producing  a  large  radar  tracking  program  which  runs  on  the  POSER  simulation.  A 
j  version  of  the  radar  program  exists  in  MASCOT  and  some  comparisons  of  the  two  complete  programs  have  been 

1  make.  Finally,  some  broad  comparisons  of  the  MASCOT  and  POSER  methods  are  made. 

[Birrell  82]  Birred.  A.  D.;  Levin,  R.;  Needham,  R.  M.;  Schroeder,  M.  D. 

Grapevine:  An  Exercise  in  Distributed  Computing. 

Communications  of  the  ACM  25(4):260-274,  April,  1982. 

Abstract 

I  GRAPEVINE  is  a  multicomputer  system  on  the  Xerox  research  internet.  It  provides  facilities  for  the  delivery  of 

digital  messages  such  as  computer  mail;  for  naming  people,  machines,  and  services:  for  authenticating  people  and 
machines;  and  for  locating  services  on  the  internet.  This  paper  has  two  goals:  to  describe  the  system  itself  and  to 
serve  as  a  case  study  of  a  real  application  of  distributed  computing.  Part  I  describes  the  set  of  services  provided  by 
GRAPEVINE  and  how  its  data  and  function  are  divided  among  computers  on  the  internet.  Part  II  presents  in  more 
detail  selected  aspects  of  GRAPEVINE  that  illustrate  novel  facilities  or  implementation  techniques,  or  that  provide 
insight  into  the  structure  of  a  distributed  system.  Part  III  summarizes  the  current  state  of  the  system  and  the 
lessons  learned  from  it  so  far. 

(Blair  82]  Blair,  G.  S.;  Hutchison,  D.;  Shepherd,  W.  D. 

MIMAS- A  Network  Operating  System  for  Strathnet. 

In  Proceedings  of  the  3rd  International  Conference  on  Distributed  Computing 
Systems  (Miami/Fort  Lauderdale,  FL),  pages  212-217.  IEEE,  October  18-22, 

1982. 

Abstract 

Recent  technological  advances  and  developments  in  user  requirements  have  led  to  the  recognition  of  a  new 
branch  of  computer  science,  that  of  distributed  systems.  A  great  deal  of  research  is  required  before  their  potential 
benefits  can  be  fully  realised.  At  Strathclyde  University,  research  into  distributed  systems  has  followed  a  bottom-up 
layered  approach.  The  first  stage  was  the  design  of  an  Ethernet-like  local  area  network  called  STRATHNET.  This 
was  followed  by  the  development  of  an  interprocess  communication  service  employing  the  notion  of  a  port  which 
provides  a  testbed  for  experimentation  into  distributed  operating  systems  design.  The  distributed  operating  system 
will  primarily  integrate  a  number  of  departmental  PDP-iVs  running  the  UNIX  operating  system  and  will  reside  in  a 
series  of  layers  above  the  UNIX  kernel.  The  main  design  criteria  for  the  system  are  ease  of  incremental  growth., 
high  availability  and  reliability.  This  paper  outlines  the  design  of  the  MIMAS  network  operating  system. 

[Boebert  78a]  Boebert,  W.  E.;  Franta,  W.  R.;  Jensen,  E.  D.;  Kain,  R.  Y. 

Decentralized  Executive  Control  in  Distributed  Computer  Systems. 

In  Proceedings  of  COMPCON  73,  pages  254-258.  IEEE,  November,  1978. 

Abstract 

This  paper  discusses  the  issues  involved  in  building  a  real-time  control  system  using  a  message-directed 
distributed  architecture.  We  begin  with  a  discussion  of  the  nature  of  real-time  software,  including  the  viability  of 
using  hierarchical  models  to  organize  the  software.  Next  we  discuss  some  realistic  design  objectives  for  a 
distributed  real-time  system  including  fault  isolation,  independent  module  verification,  contex-independence, 
decentralized  control  and  partitioned  system  state.  We  conclude  with  some  observations  concerning  the  general 
nature  of  distributed  system  software. 
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(Boebert  78b]  Boebert,  W.  E.;  Franta,  W.  R.;  Jensen,  E.  D.;  Kain,  R.  Y. 

Kernel  Primitives  of  the  HXDP  Executive. 

In  Proceedings  of  COMPCON  78 ,  pages  595-600.  IEEE,  November,  1978. 

Abstract 

This  paper  describes  the  kerne)  of  an  Executive  being  implemented  for  the  Honeywelll  Experimental  Distributed 
Processor  (HXDP)  ••  a  vehicle  for  research  in  distributed  computers  for  real-time  control.  The  kernel  provides 
message  transmission  primitives  for  use  by  application  programs  or  higher  level  executive  functions.  In  the  paper 
we  describe  the  message  transmission  primitives  provided  by  the  kernel  and  the  rationale  for  their  selection  based 
upon  the  objectives  and  constraints  described  in  a  companion  paper. 

[Boebert  XX]  W.  E.  Boebert,  D.  Cornhill,  W.  R.  Franta,  and  E.  D.  Jensen. 

Communications  in  the  HXDP  Executive. 

IEEE  Transactions  on  Software  Engineering,  19XX. 
to  appear. 

[Boyd  83]  Boyd,  R.  T.;  Dickerson,  K.  R.;  Sager,  J.  C. 

A  Distributed  Operating  System  for  Reliable  Telecommunications  Control. 

In  Fifth  International  Conference  on  Software  Engineering  for  Telecommunication 
Switching  Systems  (Lund,  Sweden),  pages  190-195.  IEE,  July  4-8, 1983. 

Abstract 

The  system  consists  of  a  number  of  loosely-coupled  processor  modules  attached  to  external  hardware.  The 
software  is  composed  of  communicating  processes.  A  key  design  feature  is  system-wide  reconfigurability  under 
operating  system  control.  This  means  that  processes  can  be  allocated  to,  and  migrated  between,  processor 
modules  as  required;  for  example,  following  module  hardware  failures,  or  for  changing  workload  requirements. 
Sections  outline  the  processor  system  architecture  and  operating  system  control  of  communication,  configuration 
management  and  fault  recovery. 

[Bruins  83]  Bruins,  Th.;  Vree,  W.;  Reijns,  G.;  van  Spronsen,  C. 

A  Layered  Distributed  Operating  System. 

In  Local  Networks.  Strategy  and  Systems.  LOCALNET  ’83  (London, 

England),  pages  351-371.  March  8-10, 1983. 

Abstract 

The  rapidly  decreasing  prices  and  increasing  performance  of  micro  electronics,  together  with  the  promising 
developments  in  the  area  of  digital  transmission  permit  a  new  approach  in  distributed  computing  architecture.  The 
basic  aim  was  to  allow  a  number  of  micro  processors  to  achieve  a  common  task,  behaving  toward  the  user  as  one 
abstract  machine.  In  order  to  avoid  vulnerable  or  critical  elements,  distribution  of  tasks  has  been  accomplished  in 
such  a  manner,  that  elimination  of  a  processor  onty  degrades  but  never  stops  the  total  service.  Special  attention  is 
given  to  the  principles  of  true  distributed  and  parallel  processing  and  the  consequences  for  the  operating  system 
services.  Emphasis  has  been  put  on  the  description  of  functions  in  higher  layers  such  as  the  call  of  not  locally 
available  functions,  the  distributed  directories  and  the  way  they  are  incorporated  and  updated.  Furthermore,  a 
description  is  given  of  the  way  connection-less  data  communications  has  been  facilitated  in  incorporating  a 
storage  function  at  the  transport  level. 

[Carulli  82]  Carulli,  M.;  Murro,  O. 

Software  Architecture  of  a  Locally  Distributed  System  Supporting  Network 
Transparent  Applications. 

In  Wescon/82  Conference  Record  (Anaheim  CA),  pages  24-32.  Electronics 
Conventions,  September  14-16, 1982. 

Abstract 

Presents  an  integrated,  distributed  system  based  on  an  Ethernet  network  of  the  Olivetti  BCS  2000  system.  A 
fundamental  objective  of  this  system  is  to  develop  distributed  applications  at  the  same  level  of  difficulty  as  in 
individual  machines.  The  authors  present,  in  particular,  the  architecture  of  the  BCOS-M  distributed  operating 
system,  the  design  of  which  is  determined  by  the  objectives  of  network  transparency  as  well  as  by  the  needs  of 
resource  distribution,  reliability  and  availability. 


[Cheriton  79]  D.  R.  Cheriton,  M.  A.  Malcolm,  et  al. 

Thoth,  a  Portable  Real-Time  Operating  System. 

Communications  of  the  ACM  22(2):  105- It 5,  February,  1979. 

Abstract 

This  paper  describes  a  portable  real-time  operating  system  called  Thoth  which  has  been  developed  at  the 
University  of  Waterloo  as  pan  of  a  research  study  into  the  feassibility  of  portable  operating  systems.  Thoth 
supports  multiple  processes,  dynamic  memory  allocation,  device-independent  input/output,  a  file  system,  multiple 
terminals,  and  swapppmg.  It  is  currently  running  on  two  minicomputers  with  quite  different  architectures  (Texas 
Instruments  990  and  Data  General  Nova). 

[Coleman  79]  Coleman,  Aaron  Ray. 

Security  Kernel  Design  for  a  Microprocessor-Based  Multilevel  Archival  Storage 
System. 

Master's  thesis,  Naval  Postgraduate  School,  Monterey,  CA,  December,  1979. 

Abstract 

This  thesis  is  a  detailed  design  of  a  security  kernel  for  an  archival  file  storage  system.  Microprocessor  technology 
is  used  to  address  a  major  part  of  the  problem  of  information  security  in  a  distributed  computer  system.  Utilizing 
multi-programming  techniques  for  processor  efficiency,  segmentation  for  controlled  sharing,  and  a  loop-free 
structures  for  avoiding  intermodule  dependencies,  the  Archival  Storage  is  designed  for  implementation  on  the 
Zilog  29001  microprocessor  with  a  memory  management  unit.  The  concepts  of  a  process  structure  and  a 
distributed  kernel  are  used  in  providing  management  of  the  shared  hardware  resources  of  the  system.  The 
security  kernel  primitives  create  a  virtual  machine  environment  and  provide  information  security  in  accordance 
with  a  non-discretionary  security  policy. 

[Comhill  79]  Comhill,  D.  T.;  Boebert,  W.  E. 

Implementation  of  the  HXDP  Executive. 

In  Proceedings  of  COMPCON  79,  pages  219-221.  IEEE,  February,  1979. 

Abstract 

This  paper  describes  a  first  implementation  of  the  executive  for  the  Honeywef!  Experimental  Disbributed  Processor 
•  (HXDP).  HXDP  has  been  built  to  investigate  distributed,  decentralized  control  in  real  time  applications.  The 
purpose  of  the  implementation  is  to  demonstrate  the  utility  of,  and  to  gain  experience  with  the  executive  primitives 
in  the  area  of  interprocess  communication. 

[Czaplicki  81  ]  Czaplicki,  C.  S. 

Advanced  Airborne  Executive. 

In  Sixth  Conference  on  Local  Computer  Networks  (Minneapolis  MN),  pages  10-12. 
IEEE,  October  12-14, 1981. 

Abstract 

The  main  objective  of  this  program  was  to  postulate,  implement  and  test  a  distributed  executive  design  which 
would  meet  the  requirements  of  various  avionics  distributed  processing  configurations.  Future  project 
requirements  are  reviewed  and  a  distributed  processing  architecture  which  best  meets  the  near-term  future  Navy 
avionic  requirements  has  been  selected.  The  goal  was  a  general  purpose  executive  program  which  would  provide 
increased  reliability,  graceful  degradation  and  expanded  processing  capability  while  providing  flexibility  in 
architectural  design  of  the  configuration  of  computers  and  processing  functions  within  a  system. 

[Finkel  80]  Finkel,  Raphael:  Solomon,  Marvin;  Tischler,  Ron. 

Arachne  User  Guide,  Version  1.2. 

Technical  Report  MRC-TSR-2066,  Mathematics  Research  Center,  Wisconsin 
University,  Madison,  Wl,  April,  1980. 

Abstract 

Arachne  is  a  multi-computer  operating  system  running  on  a  network  of  LSI-11  computers  at  the  University  of 
Wisconsin.  This  document  describes  Arachne  from  the  viewpoint  of  a  user  or  a  writer  of  user-level  programs.  All 
system  service  calls  and  library  routines  are  described  in  detail.  In  addition,  the  command-line  interpreter  and 
terminal  input  conventions  are  discussed.  Companion  reports  describe  the  purpose  and  concepts  underlying  the 
Arachne  project  and  give  detailed  accounts  of  the  Arachne  utility  kernel  and  utility  processes. 


[Friedrich  83]  Friedrich,  G.  R.;  Eser,  F.  W. 

Management  Units  and  Interprocess  Communication  of  DINOS. 

Siemens  Forsch.-  and  Entwicklungsber.  (Germany)  12(1  ):21  -27,  January,  1983. 

Abstract 

The  structure  and  implementation  aspects  of  the  processing  management  and  the  interprocess  communication 
(IPC)  offered  by  DINOS  are  described.  Hierarchically  structured  software  units  (execution  unit,  distribution  unit, 
process)  are  the  basic  objects  of  the  processing  management.  All  the  software  is  distributed  over  a  number  of 
independent  execution  units  consisting  of  a  variable  number  of  distribution  units.  The  processing  management 
allocates  these  software  units  in  a  distributed  and  decentralized  way  at  runtime.  The  interprocess  communication 
is  based  on  messages.  Hierarchical  names  ensure  independence  of  the  IPC  interface  from  the  process  allocation 
since  IPC  data  are  strictly  partitioned  and  distributed.  IPC  control  and  data  are  completely  distributed. 

[Fundis  80]  Fundis,  Roxanna;  Wallentine,  Virgil. 

Command  Processors  for  Dynamic  Control  of  Software  Configurations. 

Technical  Report  TR-80-02,  Department  of  Computer  Science,  Kansas  State 
University,  Manhattan,  KA,  July,  1980. 

Abstract 

Command  language  facilities  for  the  construction  and  execution  of  software  configuration-networks  of 
communicating  processes-are  very  limited  today  because  current  operating  systems  do  not  support  this  level  of 
complexity.  The  Network  Adaptable  Executive  (NADEX)  is  an  operating  system  which  was  designed  to  support 
dynamic  configurations-those  configurations  which  are  constructed  at  command  interpretation  time-of 
cooperating  processes.  These  dynamic  configurations  include  arbitrary  graphs  which  may  contain  cycles.  Three 
command  processors  have  been  developed  to  demonstrate  the  sufficiency  of  the  NADEX  facilities  to  support 
dynamic  configurations.  NADEX  facilities,  an  overview  of  the  Job  Control  System,  and  the  command  processor 
configuration  environment  are  presented,  followed  by  user's  guides  for  the  command  processors.  Each  command 
processor  has  different  responsibilities  and  capabilities  for  handling  configurations.  The  NADEX  Static  command 
processor  executes  completely  connected  configurations.  The  UNIX  command  processor  allows  linear 
configurations  to  be  constructed  dynamically,  and  the  MIRACLE  command  processor  allows  the  dynamic 
construction  of  arbitrary  configurations.  Syntax  graphs  and  sample  user  sessions  are  presented  for  each 
command  processor. 

[Gatefait  81  ]  Gatefait,  J.  P.;  Surleau,  P.;  Konrat,  J.  L. 

Execution  Mechanisms  for  Administration  Programs  in  the  E10.S  System. 

In  IEE  Fourth  International  Conference  on  Software  Engineering  for 

Telecommunication  Switching  Systems  (Coventry,  England),  pages  130-137. 
.IEE,  July  20-24, 1981. 

Abstract 

Addresses  some  of  the  specific  OANDM  software  problems  encountered  with  a  distributed  control  system  like  the 
E10.S,  and  the  solutions  adopted.  The  authors  successively  discuss:  the  system's  distributed  control  architecture; 
the  role  of  the  Operator  Command  Servicing  (OCS)  programs;  some  aspects  of  man-machine  communications  and 
OCS  program  execution;  mechanisms  for  access  to  system  data,  and  use  of  a  logical  model  to  provide  uniform 
descriptions  for  all  data  accessible  by  operators. 

[Geitz  81  ]  Geitz,  G.  W.;  Schmitter,  E.  J. 

BFS-Realization  of  a  Fault-Tolerant  Architecture. 

In  Eighth  Annual  Symposium  on  Computer  Architecture  (Minneapolis  MN),  pages 
163-170.  IEEE,  ACM,  May  12-14, 1981. 

Abstract 

Considers  possibilities  of  distributed  architecture  to  improve  the  reliability  of  microcomputer  systems  to  realize  a 
fault- tolerant  system.  By  using  and  extending  existing  redundancies  of  hardware,  software,  and  time,  a  partially 
meshed  ring  structure  that  meets  the  requirements  of  a  fault- tolerant  architecture  has  been  designed.  Aspects  of 
hardware  implementation,  system  software  structure,  operating  system  requirements,  fault  diagnosis,  and 
reconfiguration  are  explained,  based  on  the  fault-tolerant  architecture  Basic  Fault-tolerant  System  BFS. 


[Glorieux  81]  Glorieux,  A.  M.;  Rolin,  P.;  Sedillot,  S. 

User  Services  Offered  by  the  Application  Protocol  Implemented  in  SIRIUS-DELTA. 

In  Networks  from  the  User's  Point  of  View.  Proceedings  of  the  IFIP  7C-6  Working 
Conference  COMNET  ‘81  [ Budapest ,  Hungary),  pages  107-115.  IFIP,  May  11-15, 
1981. 

Abstract 

Over  the  years  the  need  tor  handling  distributed  applications  has  increased  tremendously.  The  authors  describe 
the  goals  and  the  architecture  of  the  distributed  data  base  management  system  SIRIUS-DELTA.  The  attribution  of 
each  layer  and  the  protocols  are  discussed.  The  user  point  ot  view  guides  the  authors  in  ail  these  definitions. 
Issues  in  query  decomposition,  concurrency  control,  failure  survival,  distributed  executive,  checkpoints  and 
performances  evaluations  are  studied. 

[Guillemont  82]  Guillemont,  M. 

The  CHORUS  Distributed  Operating  System:  Design  and  implementation. 

In  Local  Computer  Networks.  Proceedings  ot  the  IFIP  TC  6  International  In-Depth 
Symposium  on  Local  Computer  Networks  (Florence,  Italy),  pages  207-223.  IFIP, 
April  19-21, 1982. 

Abstract 

CHORUS  is  an  architecture  for  distributed  systems.  It  includes  a  method  for  designing  a  distributed  application.  A 
structure  for  i*s  execution  and  the  (operating)  system  to  support  this  execution.  One  important  characteristic  of 
CHORUS  is  that  the  major  part  of  the  system  is  built  with  the  same  architecture  as  applications.  In  particular,  the 
exchange  of  messages,  which  is  the  fundamental  communication/synchronization  mechanism,  has  been 
extended  to  the  most  basic  functions  of  the  system. 

[Heger  81  ]  Heger,  Dirk. 

Completion  and  Pilot  Testing  ot  a  Fault  Tolerant  Real  Time  Computer  System  with 
Distributed  Microcomputers:  Pilot  Implementation  (Really  Distributed  Control 
(RDC)  System). 

Technical  Report  BMFT-FB-DV-81-007,  Bundesministerium  fuer  Forschung  und 
Technologie,  Bonn-Bad  Godesberg,  Germany,  December,  1981. 

Abstract 

A  prototype  RDC  system  was  tested  and  the  completeness  of  the  hardware  and  software  components  were  proven 
in  practice  by  a  pilot  implementation.  Features  of  the  system  include:  distributed  fault  tolerant  real  time  computer 
system  with  a  fiber  optic  ring-bus  system  for  industrial  automation;  modular  design,  central  operating  by  means  of 
an  input-output  color  screen  syslem;  and  complete  programming  by  means  of  a  multicomputer  PEARL.  A  stepwise 
upgraded  pit  furnace  plant  with  28  pit  furnaces  was  selected  as  the  pilot  project.  Experience  was  given  in  the 
following  areas;  reliability  fault  diagnosis;  fault  tolerance;  fiber  optics  under  environmental  stress;  traffic  flow  in  the 
ring-bus  system  with  decentralized  control;  digital  drive  and  control  of  a  real  pit  furnace  process  using  the  high 
level  language  PEARL;  synchronization  and  interprocess  communication  with  PEARL;  use  of  a  dynamic  down  line 
loader,  application  of  a  distributed  operating  system  supporting  multicomputer  PEARL;  adaptation  of  the  PEARL 
operating  system  to  other  computers;  and  distributed  real  time  data  bases. 

[Hsia  79]  P.  Hsia. 

A  Configurable  Distributed  Computing  System. 

In  Proceedings  ot  the  First  International  Conference  on  Distributed  Computing 
Systems,  IEEE,  November,  1979. 

[Jensen  78]  Jensen,  E.  D. 

The  Honeywell  Experimental  Distributed  Processor  -  An  Overview. 

IEEE  Computer  11(1):28-38,  January,  1978. 

Abstract 

The  Honeywell  Experimental  Distributed  Processor  (HXDP)  is  a  vehicle  tor  research  in  the  science  and 
engineering  of  processor  interconnection,  executive  control,  and  use-  sofrware  for  a  certain  class  of  multiple- 
processor  computers  which  we  call  distributed  computer'  systems.  Such  systems  are  very  unconventional  in  that 
they  accomplish  total  system-wide  executive  control  in  the  absence  of  any  centralized  procedure,  data,  or 
hardware.  The  primary  benefits  sought  by  this  research  are  improvements  over  more  conventional  architectures 
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(such  a  multiprocessors  and  computer  networks)  in  extensibility,  integrity,  and  performance.  A  fundamental  thesis 
of  the  HXDP  project  is  that  the  benefits  and  cost-effectiveness  of  distributed  computer  systems  depend  on  the 
judicious  use  of  hardware  to  control  software  costs. 

[Jensen  81  ]  E.  Douglas  Jensen . 

Distributed  Control, 

In  B.  W.  Lampson,  M.  Paul,  and  H.  J.  Siegert,  Distributed  Systems  -  Architecture 
and  Implementation,  pages  175-190.  Springer-Verlag,  1981. 

[Jessop82]  Jessop,  W.  H.;  Noe,  J.  D.;  Jacobson,  D.  M.;  Baer,  J.  L.;  Pu,  C. 

The  Eden  Transaction -Based  File  System. 

In  Proceedings  of  the  Second  Symposium  on  Reliability  in  Distributed  Software  and 
Database  Systems  (Pittsburgh  PA),  pages  163-169.  IEEE,  July  19-21,  1982. 

Abstract 

THE  Eden  file  system  employs  an  object  model  approach  in  the  design  of  a  transaction-based  file  system  to  be 
used  in  the  Eden  distributed  system.  The  file  system  relies  on  a  kernel  which  provides  both  an  object  model 
abstraction  and  a  relatively  high-level  storage  system.  The  Eden  file  system  willi  provide  all  of  the  functions  of  a 
conventional  file  system.  In  addition,  it  will  serve  as  a  research  tool-kit,  both  for  developing  distributed  applications 
which  depend  on  a  general  transaction  mechanism  and  lor  research  into  the  performance  of  different  concurrency 
control  methods  which  can  be  used  within  the  transaction  mechanism. 

[Jones  79J  A.  K.  Jones,  R.  J.  Chansler,  Jr.,  I.  Durham,  K.  Schwans,  and  S.  R.  Vegdahl. 

StarOS,  a  Multiprocessor  Operating  System  for  the  Support  of  Task  Forces. 

In  Proceedings  of  the  Symposium  on  Operating  Systems  Principles,  ACM, 
December,  1979. 

[Karshmer  83]  Karshmer,  A.  I.;  Phelan,  J.;  Kempton,  B.;  Depree,  D.  J. 

The  New  Mexico  State  University  Distributed  UNIX  System:  Evaluation  and 
Extension. 

In  Proceedings  of  the  Sixteenth  Hawaii  International  Conference  on  System 
Sciences  (Honolulu  HI),  pages  225-233.  University  of  Hawaii,  University  of 
Southwestern  Louisiana,  January  5-7, 1983. 

Abstract 

Through  a  joint  effort  between  New  Mexico  State  University  and  the  Hebrew  University  of  Jerusalem,  a  distributed 
version  of  the  UNIX  operating  system  is  currently  being  developed.  A  microprocessor  version  of  the  UNIX  kernel 
has  been  designed  and  implemented  to  run  on  any  member  of  the  PDP-11  /LSI-11  family  of  processors  and  allows 
programs  to  run  in  a  'UNIX-like'  environment.  As  the  kernels  running  in  the  distributed  processing  elements 
present  a  'UNIX-like'  environment,  all  processes  in  the  system  are  fully  transportable  from  one  processor  to 
another.  While  the  original  version  of  the  system  was  built  in  a  star  configuration,  the  system  is  currently  being 
enhanced  through  the  addition  of  a  communication  ring  which  uses  8-  bit  microprocessors  as  ring  interface  units. 
The  paper  describes  microprocessors  as  ring  interface  units.  The  paper  describes  the  software  and  hardware 
structure  of  the  system  as  well  as  some  performance  measurements  taken  on  the  basic  star  version  of  the 
implementation. 

[Kartashev  82]  Kartashev,  S.  I.;  Kartashev,  S.  P. 

A  Distributed  Operating  System  for  a  Powerful  System  with  Dynamic  Architecture. 

In  AFIPS  Conference  Proceedings,  Vol  51,  1982  National  Computer  Conference 
(Houston  TX),  pages  103-116.  AFIPS,  June  7-10, 1982. 

Abstract 

The  paper  discusses  the  organization  of  a  distributed  operating  system  for  dynamic  architecture.  It  is  shown  that 
the  operating  system  must  feature  two  types  of  distribution:  (A)  functional  or  vertical,  wherely  it  is  distributed 
among  functional  units  in  accordance  with  the  types  of  conflicts  that  should  be  resolved:  and  (6)  modular  or 
horizontal,  whereby  it  is  distributed  among  modules  performing  the  same  functions.  In  a  dynamic  architecture 
there  are  three  types  of  conflicts:  memory,  reconfiguration,  and  1/0/  This  leads  to  the  division  of  OS  into  three 
subsystems:  (1)  a  processor  OS  that  resolves  memory  conflicts,  (2)  a  monitor  OS  that  resolves  reconfiguration 
conflicts,  and  (3)  an  I/O  OS  that  resolves  ail  types  of  I/O  conflicts.  The  paper  presents  a  detailed  organization  for 
the  processor  operating  system. 


[Kieburtz81] 


Kieburtz,  R.  B. 

A  Distributed  Operating  System  for  the  Stony  Brook  Multicomputer. 

In  Second  International  Conference  on  Distributed  Computing  Systems  (Paris, 
France),  pages  67-79.  Inst.  Nat.  Recherche  and  Inf.  Autom.;  Lab.  Recherche 
and  Inf.;  Paris-Sud  University  of  Orsay,  April  8-10, 1981. 

Abstract 

The  Stony  Brook  multicomputer  is  a  hierarchically  organized  network  of  computer  nodes  that  has  been  designed 
to  support  problem-solving  by  decomposition.  High  performance,  relative  to  the  speed  of  its  individual  processors, 
is  one  of  its  primary  design  goals.  This  paper  describes  the  design  of  a  message- based,  distributed,  operating 
system  nucleus  tor  the  network. 

[Lacoss  80]  Lacoss,  Richard  T. 

Distributed  Sensor  Networks. 

Technical  Report  ESD-TR-80-244,  Electronic  Systems  Division,  Hanscom  AFB,  MA, 
September,  1980. 

Abstract 

This  Semiannual  Technical  Summary  reports  work  in  the  Distributed  Sensor  Networks  program  for  the  period  1 
April  through  30  September  1980.  Progress  related  to  development  and  deployment  of  test-bed  hardware  and 
software,  including  deployment  of  three  test-bed  nodes,  is  described.  A  complete  algorithm  chain  from  raw  data  to 
aircraft  locations,  employing  two  acoustic  arrays,  has  been  developed  and  demonstrated  experimentally  using 
data  collected  from  test-bed  nodes.  A  strawman  design  for  a  new  multiple  microprocessor  test-bed  node  computer 
is  presented.  Also  described  is  progress  in  the  design  and  development  of  a  real-time  network  kernel  for  the  DSN 
test  bed  in  general,  and  the  new  processor  in  particular. 

[Lantz  82]  Lantz,  K.  A.;  Gradischnig,  K.  D.;  Feldman,  J.  A.;  Rashid,  R.  F. 

Rochester’s  Intelligent  Gateway. 

IEEE  Computer  15(10):54-68,  October,  1982. 

Abstract 

The  University  of  Rochester  has  had  several  years  experience  in  the  design  and  implementation  of  a  multiple- 
machine.  multiple-network  distributed  system  called  RIG.  or  Rochester's  Intelligent  Gateway  RIG  was  designed  as 
a  state-of-the-art  research  computing  environment  to  support  a  variety  of  distributed  applications  and  research  in 
distributed  computing.  Particular  applications  include  computer  image  analysis  and  design  automation  for  VLSI. 
Distributed  systems  research  includes  investigations  into  internetwork  architectures,  interprocess  communication, 
naming,  distributed  file  systems,  distributed  control,  performance  monitoring,  exception  handling,  debugging,  and 
user  interlaces. 

[Lazowska  81]  Lazowska,  E.  D.;  Levy,  H.  M.;  Aimes,  G.  T.;  Fischer,  M.  J.;  Fowler,  R.  J.;  Vestal,  S.  C. 
The  Architecture  of  the  Eden  System. 

Operating  Systems  Review  15(5):148-159,  December,  1981. 

Abstract 

The  University  of  Washington's  EDEN  project  is  a  five-year  research  effort  to  design,  build  and  use  an  integrated 
distributed  computing  environment.  The  underlying  philosophy  of  Eden  involves  a  fresh  approach  to  the  tension 
between  these  two  adjectives.  In  briefest  form,  Eden  attempts  to  support  both  good  personal  computing  and  good 
multi-user  integration  by  combining  a  node  machine/local  network  hardware  base  with  a  software  environment 
that  encourages  a  high  degree  of  shanng  and  cooperation  among  its  users.  The  hardware  architecture  of  Eden 
involves  an  Ethernet  local  area  network  interconnecting  a-number  of  node  machines  with  bit-map  displays,  based 
upon  the  INTEL  IAPX  432  processor.  The  software  architecture  is  object-based,  allowing  each  user  access  to  the 
imformation  and  resources  of  the  entire  system  through  a  simple  interface.  This  paper  states  the  philosophy  and 
goals  of  Eden,  describes  the  programming  methodology  that  has  been  chosen  to  support,  and  discusses  the 
hardware  and  kernel  architecture  of  the  system. 

[LeLann8i]  LeLann,  G. 

A  Distributed  System  for  Real-Time  Transaction  Processing. 

IEEE  Computer  14(2).43-48,  February,  1981. 

Abstract 

The  computing  systems  considered  in  this  article  are  built  from  a  variety  of  commonly  available  hardware 
components  for  processing,  storage,  and  communication,  such  as  minicomputers,  disks,  and  buses.  Physically 


distributed  over  short  distances,  these  systems  are  usually  labeled  as  multiple-processor  computers  or  local  area 
computer  networks.  We  begin  by  outlining  the  basic  problems  that  were  addressed  during  the  design  ol  Delta,  an 
experimental  distributed  transactional  system  built  within  the  framework  of  Protect  Sirius  We  then  discuss  some  of 
the  advantages  of  distributed  architectures  and  conclude  with  a  presentation  of  the  basic  aspects  of  Delta's 
distributed  executive  mechanisms. 

[Liu  82]  Ming  T.  Liu;  Duen-Ping  Tsay;  Lian,  R.  C. 

Design  of  a  Network  Operating  System  for  the  Distributed  Double- Loop  Computer 
Network  (DDLCN). 

In  Local  Computer  Networks.  Proceedings  ol  the  IFIP  TC  6  International  In-Depth 
Symposium  on  Local  Computer  Networks  (Florence.  Italy),  pages  225-248.  IFIP, 
April  19-21,  1982. 

Abstract 

Presents  the  framework  and  model  of  a  Network  Operating  System  (NOS)  for  use  in  distributed  systems  in  general 
and  for  use  in  the  distributed  double-loop  computer  network  (DDLCN)  in  particular.  An  integrated  approach  is 
taken  to  design  the  NOS  model  and  protocol  structure.  It  is  based  on  the  object  model  and  a  novel  task'  concept, 
using  message  passing  as  an  underlying  semantic  structure.  A  layered  protocol  is  provided  for  the  distributed 
system  kernel  to  support  NOS.  This  approach  provides  a  flexible  organization  in  which  system-transparent 
resource  sharing  and  distributed  computing  can  evolve  in  a  modular  fashion. 

[Luderer  81]  Luderer,  G.  W.  R.;  Che,  H.;  Haggerty,  J.  P.;  Kirslis,  P.  A.;  Marshall,  W.  T. 

A  Distributed  UNIX  System  Based  on  a  Virtual  Circuit  Switch. 

Operating  Systems  Review  15{5):160-168,  December,  1981. 

Abstract 

The  popular  UNIX  operating  system  provides  time-sharing  service  on  a  single  computer.  This  paper  reports  on  the 
design  and  implementation  of  a  distributed  UNIX  system.  The  new  operating  system  consists  of  two  components: 
The  S-UNIX  subsystem  provides  a  complete  UNIX  process  environment  enhanced  by  access  to  remote  files:  the 
F-UNIX  subsystem  is  specialized  to  offer  remote  file  service.  A  system  can  by  configured  out  of  many  computers 
which  operate  either  under  the  S-UNIX  nr  the  F-UNIX  operating  subsystems.  Computers  communication  with  each 
other  through  a  high-bandwidth  virtual  circuit  switch.  Small  front-end  processors  handle  the  data  and  control 
protocol  for  error  and  flow-controlled  virtual  circuits.  Terminals  may  be  connected  directly  to  the  computers  or 
through  the  switch.  Operational  since  early  1980,  the  system  has  served  as  a  vehicle  to  explore  virtual  circuit 
switching  as  the  basis  tor  distributed  system  design.  The  performance  of  the  communication  software  has  been  a 
focus  of  the  work.  Performance  measurement  results  are  presented  for  user  process  level  and  operating  system 
driver  level  data  transfer  rates,  message  exchange  times,  and  system  capacity  benchmarks.  The  architecture 
offers  reliability  and  modularty  growable  configurations.  The  communication  service  offered  can  serve  as  a 
foundation  for  different  distributed  architectures. 

[Lycklama  78]  Lycklama,  H.;  Bayer,  D.  L. 

The  MERT  Operating  System. 

The  Bell  System  Technical  Journal  57(6):2049-2086,  July,  August,  1978. 

Abstract 

The  MERT  operating  system  supports  multiple  operating  system  environments.  Messages  provide  the  major  means 
of  inter-process  communication.  Shared  memory  is  used  where  tighter  coupling  between  processes  is  desired. 
The  file  system  was  designed  with  real-time  response  being  a  maior  concern.  The  system  has  been  implemented 
on  the  DEC  POP-11/45  and  PDP-11/70  computers  and  supports  the  UNIX  time  sharing  system,  as  well  as  some 
real-time  processes.  To  provide  an  environment  favorable  to  applications  with  real-time  response  requirements, 
the  MERT  system  permits  processes  to  control  scheduling  parameters.  These  include  scheduling  priority  and 
memory  residency.  A  rich  set  of  inter-process  communication  mechanisms  including  messages,  events  (software 
interrupts),  shared  memory,  inter-process  traps,  process  ports,  and  files,  allow  applications  to  be  implemented  as 
several  independent,  cooperating  processes.  Some  uses  of  the  MERT  operating  system  are  discusses.  A 
retrospective  view  of  the  MERT  system  is  also  offered.  This  includes  a  critical  evaluation  of  some  of  the  design 
decisions  and  a  discussion  of  design  improvements  which  could  have  been  made  to  improve  overall  efficiency. 

[Mahjoub  82]  Mahjoub,  A. 

A  Distributed  Operating  System  for  a  Local  Area  Network. 

In  Ninth  Australian  Computer  Conference  Vol.  2  (Hobart,  Tasmania, 

Australia),  pages  633-647.  August  23-27, 1982. 
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Abstract 

The  design  and  implementation  of  an  experimental  distributed  operating  system  for  a  local  area  network  are 
discussed.  The  salient  feature  of  this  operating  system  is  that  it  achieves  complete  machine  transparency  and 
atomicity  of  remote  operations.  The  system,  as  a  whole,  provides  a  suitable  environment  for  a  distributed  version  of 
the  concurrent  programming  language  MODULA  without  introducing  any  modification  to  its  compiler. 

[Maisonneuve  81]  Maisonneuve,  M.;  Levy,  J.  P.;  Konrat,  J.  L. 

E10.S  Operating  System  for  a  Distributed  Architecture. 

In  IEE  Fourth  International  Conference  on  Software  Engineering  for 

Telecommunication  Switching  Systems  (Coventry,  England),  pages  124-129. 

IEE,  July  20-24,  1981. 

Abstract 

Describes  the  general  structure  of  computer-controlled  telephone  exchanges  and  rather  briefly  discusses  E10.& 
hardware,  before  entering  into  the  details  of  the  system's  software,  the  main  subiect  of  this  paper. 

[Mamrak  83]  Mamrak,  S.  A.;  Leinbaugh,  D.;  Berk,  T.  S. 

A  Progress  Report  on  the  Desperanto  Research  Project:  Software  Support  for 
Distributed  Processing. 

Operating  Systems  Review  17{1):17-29,  January,  1983. 

Abstract 

The  DESPERANTO  research  protect  has  been  investigating  topics  in  the  area  of  distributed  computing  systems 
since  the  fall  of  1980.  The  project  addresses  problems  that  arise  in  the  design  and  implementation  of  software 
support  for  general-purpose  resource  shanng  in  networks  consisting  of  heterogeneous  nodes.  Aothough  it  is  still 
premature  to  publish  the  details  of  the  solutions  to  the  desgn  problems  in  journal  (or  archival)  form,  this  report  has 
been  prepared  to  describe  design  issues  and  progress  made  to  date. 

[Manning  77]  Manning  E.;  Peebles  R.  W. 

A  Homogeneous  Network  for  Data  Sharing  •  Communications. 

Computer  Networks  1  (4):21 1  -224,  June,  1977. 

[McCarthy  81]  McCarthy,  J.  L.;  Merrill,  D.  W.;  Marcus,  A.;  Benson,  W.  H.;  Gey,  F.  C. 

SEEDIS  Project:  A  Summary  Overview. 

Technical  Report  PUB-424,  Department  of  Energy,  Washington,  DC  (UC-Berkeley), 
September,  1981. 

Abstract 

The  SEEDIS  project  includes:  a  research  program  to  investigate  information  systems  spanning  diverse  data 
sources,  computer  hardware  and  operating  systems;  a  testbed  distributed  information  system  running  on  a 
network  of  Digital  Equipment  Corporation  (DEC)  VAX  computers,  which  ®  used  for  selected  applications  as  well  as 
research  and  development;  a  set  of  interactive  information  management  and  analysis  tools  in  fields  such  as  energy 
and  resource  planning,  employment  and  training  program  management,  and  environmental  epidemiology;  and  a 
major  collection  of  databases  for  various  geographic  levels  and  time  periods  drawn  from  the  US  Census  Bureau 
and  other  sources. 

[McDonald  82]  McDonald,  W.  C.;  Smith,  R.  W. 

A  Flexible  Distributed  Testbed  for  Real-Time  Applications. 

IEEE  Computer  15(10):25-38,  October,  1982. 

Abstract 

This  article  describes  a  flexible  distributed  testbed  that  is  being  developed  to  support  the  development,  analysis, 
test,  evaluation,  and  validation  of  research  in  distributed  computing  for  real-time  applications.  The  testbed  not  only 
provides  the  resources  for  experimentally  obtaining  quantitative  results,  but  also  sserves  as  a  focal  point  lor  the 
research,  integrating  related  research  activities  and  providing  a  mechanism  for  technology  transfer  to  associated 
research  efforts. 

[McKendry  83]  McKendry,  M.  S.;  Allchin,  J.  E.;  Thibault,  W.  C. 

Architecture  for  a  Global  Operating  System. 

In  Proceedings  of  IEEE  INFOCOM  83  (San  Diego,  CA),  pages  25-30.  IEEE,  April 
18-21, 1983. 
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Abstract 

Global  operating  systems  are  suited  to  distributed,  local-area  network  environments.  A  decentralized  global 
operating  system  can  manage  all  resources  globally,  relying  on  functional  requirements  for  resource  allocation, 
rather  than  on  the  relative  physical  locations  of  the  resource  allocation  mechanism  and  the  resource  itself.  Among 
the  advantages  of  global  operating  systems  are  the  ability  to  use  idle  resources  and  to  control  the  environment  as 
a  single  cohesive  entity.  This  paper  introduces  an  architectural  approach  to  supporting  decentralized  global 
operating  systems.  The  approach  addresses  the  problem  of  managing  distributed  data  by  incorporating 
specialised  data  management  facilities  in  the  kernel.  This  data  management  is  especially  useful  to  the  operating 
system  itself.  A  capability-based  access  scheme  provides  flexible  control  of  resources  and  autonomy.  The 
approach  is  being  utilised  in  the  Clouds  operating  system  project  at  Georgia  Institute  of  Technology. 

[Measures  82]  Measures,  M.;  Carr,  P.  A.;  Shriver,  B.  D. 

A  Distributed  Operating  System  Kernel  Based  on  Dataflow  Principles. 

In  Proceedings  of  Computer  Networks  COMPCON  82.  Twenty-fifth  IEEE  Computer 
Society  International  Conference  (Washington  DC),  pages  106-115.  IEEE, 
September  20-23,  1982. 

Abstract 

The  design  of  the  Distributed  Operating  System  Kernel,  or  DOSK,  is  presented  as  an  operating  system  for  a 
distributed  computing  system.  An  extended  dataflow  model  forms  the  basis  for  both  the  programs  DOSK  executes 
and  the  implementation  of  DOSK  itself.  DOSK  can  realize  the  parallelism  in  a  program  by  distributing  portions  of 
the  program  across  the  system  for  concurrent  execution.  DOSK  consists  of  several  asynchronous  processes  that 
communicate  via  message-passing  using  a  dataflow  protocol. 

[Miller  81]  Miller,  B.;  Presotto,  D. 

XOS:  An  Operating  System  for  the  X-tree  Architecture. 

Operating  Systems  Review  15(2):21-32,  April,  1981. 

Abstract 

Describes  the  fundamentals  of  the  X-tree  Operating  System  (XOS),  a  system  developed  to  investigate  the  effects  of 
the  X-tree  architecture  on  operating  system  design.  It  outlines  the  goals  and  constraints  of  the  project  and 
describes  the  major  features  and  modules  of  XOS.  Two  concepts  are  of  special  interest  the  first  is  demand  paging 
across  the  network  of  nodes  and  the  second  is  separation  of  the  global  object  space  and  the  directory  structure 
used  to  reference  it  Weaknesses  in  the  model  are  discussed  along  with  directions  for  future  research. 

[Miller  83]  Miller,  D.  S.;  Fisher,  R.  W.;  Millard,  B.  R.;  Murthy,  V.  G. 

A  Distributed  Operating  System  for  a  Local  Area  Network. 

In  Second  Annual  Phoenix  Conference  on  Computers  and  Communications.  1983 
Conference  Proceedings  (Phoenix  AZ),  pages  281-288.  IEEE,  March  14-16, 
1983. 

Abstract 

HERBERT-II  is  a  distributed  operating  system  which  runs  on  a  local  area  network  of  three  6809  based  coder: 
intelligent  terminal  system  computers  fully  connected  by  MC6821  PIA  parallel  interfaces.  The  codex  ISOS- 
operating  system  at  each  node  has  been  extended  to  include  physical,  link,  network,  transport  and  session 
communication  layers  normally  added  on  as  an  afterthought  in  access  methods  or  utilities  in  conventional 
distributed  system  architectures.  HERBERT-II  is  a  object-oriented  UNIX-like  operating  system  which  supports 
multiprogramming  on  multiple  processors. 

[Muntz  83]  Muntz,  Charles  A. 

NSW  (National  Software  Works)  Executive  Enhancements  II. 

Technical  Report  RADC-TR-83-59,  Rome  Air  Development  Center,  Griffiss  AFB, 

NY,  March,  1983. 

Abstract 

The  Na:ional  Software  Works  (NSW)  represents  a  significant  evolutionary  in  the  fields  of  distributed  processing 
and  network  operating  systems.  Its  ambitious  goal  has  been  to  fink  the  resources  of  a  set  of  geographically 
distributed  and  heterogeneous  hosts  with  an  operating  system  which  would  appear  as  a  single  entity  to  a  user.  It  is 
principally  aimed  at  the  development  of  software  systems  and  at  providing  software  tools  which  can  be  used  to 
support  the  software  development  activity  throughout  its  life  cycle.  This  report  describes  the  current  status  of  the 
NSW  system  as  well  as  highlights  the  enchancements  and  improvements  made  to  the  NSW  system  during  the  past 
two  years. 
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[Ousterhout  80]  Ousterhout  J.  K.;Scelza  D.  A.;Sindhu  P.  S. 

Medusa:  An  Experiment  in  Distributed  Operating  System  Structure. 

Communications  of  the  ACM  23(2):92-105,  February,  1980. 

Abstract  .  . 

The  design  of  Medusa,  a  distributed  operating  system  for  the  Cm  multimicroprocessor,  is  discussed.  The  Cm 
architecture  combines  distribution  and  sharing  in  a  way  that  strongly  impacts  the  organization  of  operating 
systems.  Medusa  is  an  attempt  to  capitalize  on  the  architectural  features  to  produce  a  system  that  is  modular, 
robust,  and  efficient.  To  provide  modularity  and  to  make  effective  use  of  the  distributed  hardware,  the  operating 
system  is  partitioned  into  several  disjoint  utilities  that  communmicate  with  each  other  via  messages.  To  take 
advantage  of  the  parallelism  present  in  Cm  and  to  provide  robustness,  all  programs,  including  the  utilities,  are 
task  forces  containing  many  concurrent,  cooperating  activities. 

[Popek  81]  Popek,  G.;  Walker,  B.;  Chow,  J.;  Edwards,  D.;  Kline,  C.;  Rudisin,  G.;  Thiel,  G. 

LOCUS:  A  Network  Transparent,  High  Reliability  Distributed  System. 

Operating  Systems  Review  15(5): 169- 177,  December,  1981. 

Abstract 

LOCUS  is  a  distributed  operating  system  that  provides  a  very  high  degree  of  network  transparency  while  at  the 
same  time  supporting  high  performance  and  automatic  replication  of  storage.  By  network  transparency  the 
authors  mean  that  at  the  system  call  interface  there  is  no  need  to  mention  anything  network  related.  Knowledge  of 
the  network  and  code  to  interact  with  foreign  sites  is  below  this  interface  and  is  thus  hidden  from  both  users  and 
programs  under  normal  conditions.  LOCUS  is  application  code  compatible  with  UNIX,  and  performance  compares 
favorably  with  standard,  single  system  UNIX.  LOCUS  runs  on  a  high  bandwidth,  low  delay  local  network.  It  is 
designed  to  permit  both  a  significant  degree  of  local  autonomy  for  each  site  in  the  network  while  still  providing  a 
network-wide,  location  independent  name  structure.  Atomic  file  operations  and  extensive  synchronization  are 
supported.  Small,  slow  sites  without  local  mass  store  can  coexist  in  the  same  network  with  much  larger  and  more 
powerful  machines  without  larger  machines  being  slowed  down  through  forced  interaction  with  slower  ones. 
Graceful  operation  during  network  topology  changes  is  supported. 

[Rapantzikos  81]  Rapantzikos,  Demosthenis  K. 

Detailed  Design  and  Implementation  of  the  Kernel  of  a  Real-Time  Distributed 
Multiprocessor  Operating  System. 

Master's  thesis,  Naval  Postgraduate  School,  Monterey,  CA,  March,  1981. 

Abstract 

This  thesis  presents  the  detailed  design  and  implementation  of  the  kernel  of  a  real-time,  distributed  operating 
system  for  a  microcomputer  based  multiprocessor  system.  Process  oriented  structure,  segmented  address  spaces 
and  a  synchronization  mechanism  based  on  event  counts  and  sequencers  comprise  the  central  concepts  around 
which  this  operating  system  is  built.  The  operating  system  is  hierarchically  structured,  layered  in  three  loop  free 
levels  of  abstraction  and  fundamentally  configuration  independent.  This  design  permits  the  logical  distribution  of 
the  kernel  functions  in  the  address  space  of  each  process  and  the  physical  distribution  of  system  code  and  data 
among  the  microcomputers.  This  physical  distribution  in  turn,  in  a  multimicroprocessor  configuration  will  help  to 
minimize  system  bus  contention.  The  system  particularly  supports  applications  where  processing  for  which  this 
system  has  been  specifically  developed.  The  implementation  was  developed  for  the  INTEL  88/12A  single-board 
computer  using  the  8086  processor  chip. 

[Rashid  81]  Rashid,  R.  F.;  Robertson,  G.  G. 

Accent:  A  Communication  Oriented  Network  Operating  System  Kernel. 

Operating  Systems  Review  15(5):64-75,  December,  1981. 

Abstract 

Accent  is  a  communication  oriented  operating  system  kernel  being  built  at  Camegie-Mellon  University  to  support 
the  distributed  personal  computing  project,  SPICE,  and  the  development  of  a  fault-tolerant  Distributed  Sensor 
Network  (DSN).  Accent  is  built  around  a  single,  powerful  abstraction  of  communication  between  processes,  with 
all  kernel  functions,  such  as  device  access  and  virtual  memory  management  accessible  through  messages  and 
distributable  throughout  a  network.  In  this  paper,  specific  attention  is  given  to  system  supplied  facilities  which 
support  transparent  network  access  and  fault-tolerant  behavior.  Many  of  these  facilities  are  already  being  provided 
under  a  modified  version  of  VAX/UNIX.  The  Accent  system  itself  is  currently  being  implemented  on  the  Three 
Rivers  Corp.  PERQ. 
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[Reiter  81  ]  Reiter,  E.  E.;  Zimmerman,  D.  L. 

Distributed  Operating  System  for  Cooperating  Functional  Processors. 

Technical  Report  UCID-19847,  Lawrence  Livermore  National  Laboratory,  CA,  1981. 

Abstract 

The  paper  has  been  divided  into  several  main  chapters.  This  first  chapter  contains  a  discussion  of  the  goals  of  the 
system,  the  architecture  assumptions  used,  and  the  structure  of  FP.  The  next  chapters  present  an  overview  and 
discussion  of  the  service  support  layer  processes,  the  other  upper  level  supports  for  the  partition  of  problems, 
parallel  processing,  and  the  implementation  of  the  FP  language.  These  chapters  are  the  heart  of  the  paper,  in  the 
sense  that  they  deal  with  the  possibility  of  an  implementation  of  a  parallel  processor  that  accepts  FP.  The  next 
chapter  is  a  discussion  of  the  kernel  of  the  operating  system.  In  this  distributed  system,  this  is  the  message  passing 
system.  It  allows  processes  on  the  same  or  different  nodes  to  communicate,  and  is  thus  the  backbone  of  the  entire 
system.  Finally,  we  have  included  another  chapter  which  reviews  some  of  the  issues  covered  in  this  system.  For 
instance,  we  included  discussions  of  synchronization,  resource  allocation,  and  protection. 

[Restorick  82]  Restorick,  F.  M.;  Pardoe,  B.  H. 

A  Multi-Microprocessor  Design  for  Use  in  a  Packet  Switched  Network. 

In  Pathways  to  the  Information  Society.  Proceedings  of  the  Sixth  International 
Conference  on  Computer  Communication  (London,  England),  pages  81 1-816. 
International  Council  of  Computer  Communication,  September  7-10, 1982. 

Abstract 

Describes  a  multi-processor  architecture  which  is  particularly  suited  to  act  as  a  nods  processor  in  a  packet 
switching  environment.  The  basic  concept  is  that  each  high  speed  link  entering  the  node  has  its  own  dedicated 
module,  containing  its  own  packet  memory,  CPU.  and  operating  software.  There  are  two  global  busses  which  act 
as  an  interconnect  between  the  separate  modules.  These  are  the  system  bus,  and  an  inter  CPU  bus.  Due  to  the 
Moose'  coupling  between  each  processor  module,  the  possibility  of  failure  of  the  whole  node  is  reduced.  The  basic 
kernel  of  the  distributed  operating  system  needed  to  run  this  multi-processor  as  a  packet  switching  node  is 
discussed.  The  recovery  mechanisms  with  regard  to  a  fink  module  failure  is  also  dealt  with. 

[Rieger  79]  Rieger,  Chuck. 

ZMOB:  A  Mob  of  256  Cooperative  Z80A -Based  Microcomputers. 

Technical  Report  TR-825,  Department  of  Computer  Science,  Maryland  University, 
College  Park,  MD,  November,  1979. 

Abstract 

Current  directions  of  computer  science  and  computing  in  general  are  toward  more  parallel  machine  architectures 
and  distributed  models  of  computing  based  upon  these  new  architectures.  Recently,  there  has  been  considerable 
interest  in  highly  parallel  architectures  capable  of  supporting  complex  distributed  computation  via  a  large  number 
of  autonomous  processors.  ZMOB  is  such  a  machine,  currently  under  design  and  simulation.  Architecturally, 
ZMOB  is  a  collection  of  2S6  identical  but  autonomous  ZBOA-based  microcomputers  (processors).  Each  processor 
comprises  32K  bytes  of  375  ns  read/write  central  memory  (expandable  to  48K  bytes),  up  to  4K  bytes  of  resident 
operating  system  on  450ns  EPROM,  an  8-  bit  hardware  multiplier,  and  interface  logic  for  communications 
functions. 

[Rieger  81]  Rieger,  C. 

ZMOB:  Doing  it  in  Parallel! 

In  198  7  IEEE  Computer  Society  Workshop  on  Computer  Architecture  for  Pattern 
Analysis  and  Image  Database  Management  (Hot  Springs  VA),  pages  133-140. 
IEEE,  November  11-13, 1981. 

Abstract 

The  architecture  and  applications  of  ZMOB,  a  256  processor  computer  for  artificial  intelligence  and  general 
computer  science  research,  are  described.  This  machines's  16  million  byte  distributed  memory,  100  million 
instruction  per  second  overall  throughput,  and  high  speed  interprocessor  communication  make  ZMOB  attractive 
and  appropriate  for  a  wide  range  of  basic  and  applied  research  in  parallel  computing.  ZMOB's  price  tag  is 
approximately  $150K,  and  the  machine  will  be  operational  by  late  1981. 


[Rivoira  82] 


Rivoira,  S.;  Serra,  A. 

A  Multimicro  Architecture  and  its  Distributed  Operating  System  for  Real  Time 
Control. 

In  Proceedings  of  the  3rd  International  Conference  on  Distributed  Computing 
Systems  (Miami/Fort  Lauderdale  FL),  pages  238-246.  IEEE,  October  18-22, 

1982. 

Abstract 

In  a  tightly  coupled  multi-microcomputer  system  suitable  for  process  control  applications,  the  microcomputers  are 
grouped  into  a  cluster  and  communicate  using  a  high  speed  parallel  common  bus.  Hardware  mechanisms  are 
provided  as  supports  for  the  implementation  of  synchronization  primitives  between  processes  allocated  on 
different  processors.  The  system  fault- tolerance  is  achieved  by  memory  management  units,  which  relocate  and 
protect  programs  and  data  against  faults  and  programming  mistakes.  The  distributed  operating  system  kernel 
makes  available  a  virtual  machine  where  processes' allocated  on  different  processors  are  executed  in  parallel,  and 
processes  which  reside  on  the  same  processor  are  executed  in  a  multitasking  environment 

[Schmidtke  82]  Schmidtke,  F.  E. 

A  Communication  Oriented  Operating  System  Kernel  for  a  Fully  Distributed 
Architecture. 

In  Pathways  to  the  Information  Society.  Proceedings  of  the  Sixth  international 
Conference  on  Computer  Communication  (London,  England),  pages  757-762. 
International  Council  of  Computer  Communication,  September  7- 10, 1982. 

Abstract 

Starting  with  a  description  of  the  considered  network  architecture  of  the  loosely  coupled  multimicrocomputer 
system  SIELOCNET.  The  basic  design  principles  of  the  approach  are  outlined.  The  currently  implemented  network 
operating  system  called  OINOS  is  based  on  autonomous  system  software  for  all  computer  nodes  which  cooperate 
with  other  components  by  well  defined  protocols.  It  is  based  on  a  state-of-the-art  realtime-multitasking  kernel 
managing  the  local  activities  of  a  single  node.  The  OINOS  communication  mechanism  across  computer 
boundaries  as  well  as  the  overall  load  balancing  and  allocation  management  are  embedded  within  a  layered 
structure  of  each  focal  operating  system.  For  a  programmer  there  Is  a  unique  addressing  scheme  lor  local  objects 
within  a  single  computer  and  remote  ones  residing  elsewhere. 

[Schmidtke  83]  Schmidtke,  F.  E. 

Operating  System  for  an  Optical- Bus  Local  Network. 

Siemens  Forsch.-  and  Entwicklungsber.  (Germany)  12(1):16-20,  January,  1983. 

Abstract 

The  report  introduces  the  network  architecture  of  SIELOCNET  and  its  functional  decomposition  into  workstations, 
dedicated  computers  and  arbitrary  processing  nodes.  The  basic  design  goals  and  characteristics  of  OINOS,  a 
Distributed  Network  Operating  System,  are  outlined.  It  is  designed  and  implemented  as  a  hierarchically  layered 
system  providing  a  separation  of  mechanisms  and  strategies  and  offering  a  completely  transparent  interface  to 
individual  application.  OINOS  consists  of  a  collection  of  autonomous  but  cooperative  local  node  operating 
systems,  each  of  which  is  a  collection  of  partly  replicated,  partly  specific  software  modules  bound  together  in  a 
system  generation  procedure.  Together  they  define  the  functional  capabilities  of  a  node. 

[Sedillot  80]  S.  Sedillot  and  G.  Sergeant. 

The  Consistency  and  Execution  Control  Systems  for  a  Distributed  Data  Base  in 
SIRIUS- DELTA. 

Paper  proposed  to  IFIP  80  Congress. 

[Sergeant  79]  G.  Sergeant  and  L.  Treille. 

SER:  A  System  for  Distributed  Execution  Based  on  Decentralized  Control 
Techniques. 

Paper  proposed  to  IFIP  80  Congress. 


[Solomon  79]  M.H.  Solomon  and  R.A.  Hnkel. 

The  Roscoe  Distributed  Operating  System. 

In  Proceedings  7th  ACM  Symposium  of  Operating  Systems  Principles,  pages 
108-114.  ACM,  December,  1979. 

[Springer  82]  Springer,  J.  F. 

The  Architecture  of  a  Multi-computer  Signal  Processing  System. 

In  Proceedings  of  the  Real-Time  Systems  Symposium  (Los  Angeles,  CA),  IEEE, 
December  7-9, 1982. 

Abstract 

This  paper  describes  the  architecture  of  a  recently  developed  multi-microcomputer  signal  processor.  The  purpose 
of  this  development  is  to  provide  a  flexible  system  capable  of  ready  application  to  a  variety  of  signal  processing 
problems  using  a  combination  of  special  purpose  and  off.-the-shelf  single  board  computers.  The  system  is 
supported  by  an  equally  flexible  distributed  software  system  comprising  operating  systems  and  application 
components. 

[Tanenbaum  81]  Tanenbaum,  A.  S.;  Mullender,  S.  J. 

An  Overview  of  the  Amoeba  Distributed  Operating  System. 

Operating  Systems  Review  15(3):51-64,  July,  1981. 

Abstract 

Describes  the  design  of  a  distributed  operating  system,  AMOEBA,  intended  to  control  a  collection  of  machines 
based  on  the  pool-of-piccessors  idea. 

[Tokuda  83]  Hideytiki  Tokuda,  Sanjay  R.  Radia  and  Eric  G.  Manning. 

Shoshm  OS:  a  Message-based  Operating  System  for  a  Distributed  Software 
Testbed. 

In  Proceedings  of  the  Sixteenth  Hawaii  International  Conference  on  System 

Sciences,  1983  (Honolulu  HI),  pages  329-338.  University  of  Hawaii,  University  of 
Southwestern  Louisiana,  January  5-7, 1983. 

Abstract 

A  distributed  software  testbed,  called  SHOSHIN,  has  been  constructed  to  study  the  development  and  evaluation  of 
distributed  software.  The  SHOSHIN  system  consists  of  two  PDP  11/45’s  and  ten  LSI  11/23’s  connected  by  a 
taiiormade  high-speed,  parallel  bus.  called  the  SCHOOLBUS.  The  SHOSHIN  OS  runs  on  each  LSI  11/23 
processor,  to  provide  a  distributed  program  environment  This  paper  describes  the  software  architecture  of  the 
SHOSHIN  OS,  focusing  on  network  transparent  process  management  and  interprocess  communication. 

[Trigg  81]  Trigg,  R. 

Software  on  ZMOB:  An  Object-Oriented  Approach. 

In  1981  IEEE  Computer  Society  Workshop  on  Computer  Architecture  for  Pattern 
Analysis  and  Image  Database  Management  (Hot  Springs  VA),  pages  133-140. 
IEEE,  November  11-13, 1981. 

Abstract 

This  paper  discusses  the  future  of  software  on  ZMOB  with  particular  attention  paid  to  the  object-oriented 
programming  style.  Included  is  a  look  at  the  current  languages  supported  by  ZMOB  as  well  as  future  possibilities. 
The  suitability  of  the  object-oriented  style  for  ZMOB  is  discussed  and  various  application  areas  are  briefly 
described  including  the  domain  of  mechanism  simulation.  Finally  some  ramifications  of  object-oriented 
programming  to  graphics  applications  are  pointed  out. 

[Tsay81]  Duen-Ping  Tsay;  Liu,  M.  T. 

MIKE:  A  Network  Operating  System  for  the  Distributed  Double-Loop  Computer 
Network  (DDLCN). 

In  Proceedings  of  COMPSAC  81.  IEEE  Computer  Society's  Fifth  International 

Computer  Software  and  Applications  Conference  (Chicago,  IL),  pages  388-402. 
IEEE,  November  16-20, 1981. 

Abstract 

This  paper  presents  the  framework  and  model  of  a  network  operating  system  (NOS)  called  MIKE  for  use  in 


distributed  systems  in  general  and  (or  use  in  the  distributed  double-loop  computer  network  (DDLCN)  in  particular. 
MIKE,  which  stands  lor  Multicomputer  Integrated  KEmel,  provides  system-transparent  operating  for  users  and 
maintains  cooperative  autonomy  among  local  hosts.  An  integrated  approach  is  taken  to  design  the  NOS  model  and 
protocol  structure.  MIKE  is  based  on  the  object  model  and  a  novel'task'  concept,  using  message  passing  as  an 
underlying  semantic  structure.  A  layered  protocol  is  provided  for  the  distributed  system  kernel  to  support  NOS. 
This  approach  provides  a  flexible  organization  in  which  system-transparent  resource  sharing  and  distributed 
computing  can  evolve  in  a  modular  fashion.  In  this  paper,  the  NOS  model  as  well  as  the  notion  of  'task'  are  first 
presented  and  the  system  naming  convention  is  then  examined.  A  two-level  process  interaction  model  is  next 
described.  The  protection  mechanism  is  then  discussed  emphasizing  maximal  error  confinement.  A  scenario  for 
system-transparent  resource  sharing  using  the  above  concepts  is  also  given.  Finally,  a  multilayer,  multidestination 
protocol  structure  is  detailed. 

[Tsuruho  82]  Tsuruho,  S.;  Murata,  N.;  Haihara,  M. 

Design  and  Implementation  of  DIPS  104-03  Operating  System  for  Distributed 
Processing. 

REVIEW  of  the  Electrical  Communication  Laboratories  30(6) :990- 1000,  November, 
1982. 

Abstract 

Describes  the  DIPS  distributed  processing  system  design  and  implementation,  and  clarifies  the  software 
technology  in  realizing  the  system.  The  distributed  processing  technology  is  discussed  for  two  cases:  the  large 
scale  distributed  system  and  load  distributed  system,  as  follows:  (1)  How  to  share  the  functions  between 
communication  processing  and  information  processing.  (2)  How  to  retain  the  distribution  transparency  for 
application  program.  (3)  How  to  control  interprocessor  communication.  (4)  How  to  manage  the  files  shared  among 
processes. 

[Van  Den  Eijnden  82] 

Van  Den  Eijnden,  P.  M.  C.  M.;  Dortmans,  H.  M.  J.  M.;  Kemper,  J.  P.;  Stevens,  M.  P.  J. 
Jobhandling  in  a  Network  of  Distributed  Processors. 

Technical  Report  EUT-82-E-131,  Eindhoven  Univ.  Technol.,  Netherlands,  October, 
1982. 

Abstract 

Describes  the  development  of  a  completely  distributed  modular  computer  system.  The  system  is  composed  of 
processing  units,  which  can  perform  specified  tasks  independently.  Adding  intelligence  to  peripheral  devices,  by 
means  of  microprocessors  and  buffer  memories,  provides  for  independent  functioning  of  these  peripherals.  An 
intensive  transport  between  the  devices  is  required.  The  devices  are  therefore  connected  by  means  of  a 
nonblocking  communication  network,  to  gain  full  profit  of  their  intelligence.  The  intelligent  devices  are  also 
connected  to  a  central  facility  containing  the  operating  system.  The  operating  system  is  distributed  over  a  number 
of  cooperating  modules.  Each  operating  system  module  supports  one  intelligent  device.  The  operating  system 
modules  control  the  load  among  the  devices  and  see  to  the  correct  processing  of  jobs,  presented  by  a  user.  Each 
is  equipped  with  its  own  buffer  capacities  and  processing  power.  The  network  that  interconnects  the  operating 
system  modules  has  the  same  structure  as  that  linking  the  devices.  The  operating  system  modules  are  relatively 
simple,  because  each  intelligent  device  has  the  same  characteristics,  seen  from  the  operating  system  point  of 
view. 

[Van  Der  Linden  81] 

Van  Der  Linden,  R. 

A  Multi -Processor  System  for  Data  Communication. 

In  Implementing  Functions:  Microprocessors  and  Firmware.  Seventh  Euromicro 
Symposium  on  Microprocessing  and  Microprogramming  (Paris,  France),  pages 
117-123.  September  8- 10, 1981. 

Abstract 

A  research  project  into  developing  a  system  for  data  switching  and  handling  data  is  described  The  system  is 
based  on  microprocessors  supported  by  large  scale  integrated  peripherals,  communicating  with  each  other  over  a 
high  speed  bus.  The  internal  data  transmission  rate  is  10  megabytes.  The  software  of  the  system  is  based  on  the 
distributed  system  approach,  where  a  job  is  performed  by  several  processes.  The  multitask  operating  system  was 
especially  developed  for  handling  real-time  applications  and  for  solving  difficulties  relevant  to  the  data 
environment 


[Van  Tilborg  81a]  VanTilborg,  A.  M.;  Wittie,  L.  D. 

Distributed  Task  Force  Scheduling  in  Multi-Microcomputer  Networks. 

In  AFIPS  Conference  Proceedings  (Chicago,  IL),  pages  283-289.  AFIPS,  May  4-7, 

1981. 

Abstract 

Efficient  task  scheduling  techniques  are  needed  for  microcomputer  networks  to  be  used  as  general  purpose 
computers.  The  wave  schuduling  technique,  developed  for  the  MICRONET  network  computer,  co-schedules 
groups  of  related  tasks  onto  available  network  nodes.  Scheduling  managers  are  distributed  over  a  logical  control 
hierarchy.  They  subdivide  requests  for  groups  of  free  worker  nodes  and  send  waves  of  requests  towards  the 
leaves  of  the  control  hierarchy,  where  all  workers  are  located.  Because  requests  from  different  managers  compete 
for  workers,  a  manager  may  have  to  try  a  few  times  to  schedule  a  task  force,  each  task  force  manager  actually 
requests  slightly  more. workers  than  it  really  needs.  It  computes  a  request  size  which  minimizes  expected 
scheduling  overhead,  as  measured  by  total  idle  time  in  worker  nodes,  using  a  Markov  queueing  model,  it  is  shown 
that  wave  scheduling  in  a  network  of  microcomputers  is  almost  as  effecient  as  centralized  scheduling. 

[Van  Tilborg  81b]  Van  Tilborg,  A.  M.;  Wittie,  L.  D. 

Wave  Scheduling:  Distributed  Allocation  of  Task  Forces  in  Network  Computers. 

In  Second  International  Conference  on  Distributed  Computing  Systems  (Paris, 
France),  pages  337-347.  Inst.  Nat.  Recherche  and  Inf.  Autom.;  Lab.  Recherche 
and  Inf.;  Paris-Sud  University  of  Orsay,  April  8-10, 1981. 

Abstract 

The  new  wave  scheduling  technique  is  described  and  analyzed.  It  distributes  task  force  scheduling  by  recursively 
subdividing  and  issuing  wavefront-like  requests  to  worker  nodes  capable  of  executing  user  tasks.  The  technique  is 
not  restricted  to  any  particular  network  computer  interconnection  :  apology.  It  uses  a  hierarchical  high-level 
operating  system  control  structure  to  partition  competing  task  forces  among  nodes  in  any  network  structure.  A 
cost  model  shows  how  to  minimize  wasted  processing  capacity  by  using  perceived  network  load  to  vary  the  wave 
scheduling  technique. 

[Vcsbury  82]  Vosbury,  N.;  Bryant,  C. 

System  Software  for  Experiments  in  Distributed  Computing  on  a  Distributed 
Testbed. 

In  Proceedings  of  the  3rd  International  Conference  on  Distributed  Computing 
Systems  (Miami/Fort  Lauderdale,  FL),  pages  410-415.  IEEE,  October  18-22, 

1982. 

Abstract 

Describes  the  system  software  for  supporting  experiemnts  in  distributed  computing  on  a  crossbar-interconnected 
multi-microprocessor  system  testbed.  This  software  includes  operating  system  services,  system  utilities,  and  a 
compiler  for  the  language  POL.  The  POL  compiler  includes  a  type  transfer  capability,  a  special  procedure  call,  and 
utilities  for  tasking  that  support  operating  system  work.  Operating  system  components  include  Nucleus  Monitor 
Services  (NMS).  The  Kernel  Operating  System  (KOS),  and  the  Master  Operating  System  (MOS).  NMS  provides  the 
most  basic  services  in  each  microcomputer.  KOS  executes  in  each  microcomputer  and  is  responsible  for 
managing  the  local  resources.  MOS  provides  global  management  for  the  crossbar  system  computing  resources 
and  an  interface  to  an  architecture  design  system  that  can  be  used  to  construct  experiments  on  existing  testbed 
hardware. 

[Wasano  81]  Wasano,  T.;  Kamio,  M.;  Amano,  K. 

Development  of  Executive  Program  in  DIPS  104-02  Operating  System. 

REVIEW  of  the  Electrical  Communication  Laboratories  29(5-6):368-394,  May-June, 
1981. 

Abstract 

Design  considerations  for  the  DIPS  104-02  operating  system  executive  program,  applied  to  the  large  scale  data 
communication  systems,  are  discussed  from  the  following  points  of  view:  software  layer  structure  and  the 
functions  of  each  layer;  virtualization  and  distributed  processing  techniques  in  th  computer  center  and  in  the 
network;  and  operationability  and  reliability.  New  methods  to  improve  control  pei  vmance  for  the  high  speed 
KANJI  printer  and  the  CPU/memory  resources  scheduling  are  discussed. 
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[Wasson  80)  Wasson,  Warren  James. 

Detailed  Design  of  the  Kernel  of  a  Real-Time  Multiprocessor  Operating  System. 
Master's  thesis,  Naval  Postgraduate  School,  Monterey,  CA  June,  1980. 

Abstract 

This  thesis  describes  the  detailed  design  of  a  distributed  operating  system  for  a  real-time,  microcomputer  based 
multiprocessor  system.  Process  structuring  and  segmented  address  spaces  comprise  the  central  concepts  around 
which  this  system  is  built.  The  system  particularly  supports  applications  where  processing  is  partitioned  into  a  set 
of  multiple  processes.  One  such  area  is  hat  of  digital  signal  processing  for  which  this  system  has  been  specifically 
developed.  The  operating  system  is  hierarchically  structured  to  logically  distribute  its  (unctions  in  each  process. 
This  and  loop-free  properties  of  the  design  allow  for  the  physical  distribution  of  system  code  and  data  amongst  the 
microcomputers.  In  a  multiprocessor  configuration,  this  physical  distribution  minimizes  system  bus  contention  and 
lays  the  foundation  for  dynamic  reconfiguration. 

[Waumans  82]  Waumans,  B.  L.  A. 

Software  Aspects  of  the  Phidias  System. 

Philips  Tech.  Rev.  (Netherlands)  40(8-9):262-268,  August,  1982.  . 

Abstract 

The  PHIDIAS  distributed  communication  system  is  built  up  from  ‘PRIMES'  (PRocessors  with  individual  MEmory). 
which  exchange  messages  by  means  of  a  common  communication  network.  The  system  does  not  have  a  common 
memory.  PHIDIAS  executes  programs  that  themselves  have  a  distributed  character.  To  enable  programs  to  be 
written  that  are  independent  of  a  specific  architecture  or  a  particular  computer  system,  an  existing  programming 
language  was  extended  to  include  the  facility  for  building  up  programs  from  independent  processes  (called 
‘SOMAS’)  that  exchange  messages  with  one  another.  The  operating  system  of  PHIDIAS  comprises  a  global 
operating  system  and  a  number  of  local  operating  systems  for  the  different  PRIMES.  The  global  operating  system 
can  put  defective  PRIMES  out  of  action  in  the  event  of  errors  and  redistribute  the  programs  among  the  remaining 
PRIMES.  The  local  operating  systems  ensure  that  a  number  of  SOMAS  can  run  on  one  PRIME. 

[Waxman  80]  Waxman,  Robert;  Domitz,  Robert;  Goldberg,  Frederick. 

Communications  Processor  Operating  System,  Volume  8.  Task  8, 
System/Subsystem  Specification . 

Technical  Report  RADC-TR-80-187-VOL-8,  Plessey,  Fairfield,  NJ,  June,  1980. 

Abstract 

The  Communications  Processor  Operating  System  (CPOS)  effort  is  one  program  of  a  multiple  program  effort 
whose  purpose  is  the  development  of  a  Unified  Digital  Switch  (UDS)  for  strategic  communications.  This  switch  will 
have  the  capability  to  perform  circuit,  packet  and  store-and  -forward  message  switching  in  an  integrated 
communication  complex.  The  Communications  Processor  System  (CPS)  will  control  the  switching  node  and  will  be 
supported  by  an  operating  system  called  the  Communications  Processor  Operating  System.  In  particular, 
multilevel  communications  security  conforming  to  DoD  requirements  represents  a  difficult  problem  for  the  CPOS 
and  requires  solutions  which  are  on  the  fringe  of  the  current  technology.  In  addition,  the  need  for  high  reliability  is 
a  cause  of  concern  because  of  the  inexact  science  of  software  technology.  These  concerns  have  resulted  in  heavy 
emphasis  being  given  to  Tasks  2,  3,  6  and  7.  A  specification  has  been  prepared  as  a  stand-alone  document 
suitable  for  the  next  stage  of  contractual  or  in-house  development  of  the  CPOS. 

[Wilcox  81]  Wilcox,  Dwight 

Computer  Hardware  Executive:  Concept  and  Hardware  Design. 

Technical  Report  NOSC/TR-721 ,  Naval  Ocean  Systems  Center,  San  Diego,  CA, 
September,  1981. 

Abstract 

Large  multiprocessing  and  distributed  processing  computer  systems  suffer  from  diminishing  returns  in  system 
performance  as  additional  processors  are  added.  The  slow  execution  speed  of  executive  software  is  one  of  the 
principal  causes  of  this  phenomenon.  The  purpose  of  the  executive  software  is  to  regulate  the  time  when  the 
various  application  programs  gain  access  to  the  computer  system  resources.  This  task  investigated  the  potential  of 
special-purpose  hardware  to  eliminate  the  execution-speed  bottlenecks  within  executive  software.  A  unit,  named 
the  Hardware  Executive,  was  designed  and  fabricated.  The  Navy  standard  SDEX/M  executive  was  used  as  a 
model.  Algorithms  were  developed  for  the  executive  functions  of  task  creation,  task  dispatching,  intratask 
coordination,  real-time  clock  management,  and  event-lo-task  registration  and  translation. 


[Wittie  80] 


Wittie,  L.  0. 

A  Distributed  Operating  System  for  a  Reconfigurable  Network  Computer. 
IEEE  Transactions  on  Computers,  1980. 


[Wittie  82]  Wittie,  L.  D.;  Fischer,  D.  M. 

The  Design  of  a  Portable  Distributed  Operating  System. 

In  Proceedings  of  the  Fifteenth  Hawaii  International  Conference  on  System 

Sciences  Vo/.  1  (Honolulu  HI),  pages  324-332.  University  of  Hawaii,  University  of 
Southwestern  Louisiana,  January  6-8, 1982. 

Abstract 

MICROS  is  the  distributed  operating  system  for  MICRONET,  a  reconfigurable  network  of  sixteen  loosely-coupled 
LSI-1  Is  each  connected  by  a  packet-switching  front  end  to  two  of  many  high-speed  busses.  MICROS  allows  many 
users  to  each  run  multicomputer  programs  controlled  by  UNIX-like  commands.  MICROS  consists  of  both  local  arid 
global  system  modules.  The  same  local  modules  are  resident  in  each  node  to  load  task  code  and  to  pass 
messages.  Global  operating  system  tasks  are  dynamically  loaded  into  selected  nodes  and  cooperate  to  manage 
network  resources  in  successively  more  global  nested  subtrees.  MICROS  will  eventually  include  initialization 
routines  to  select  a  virtual  tree  of  resource  management  nodes  within  arbitrarily  connected  networks  of  thousands 
of  nodes.  A  new  version  of  MICROS  with  tools  for  developing  and  debugging  large  distributed  application 
programs  is  being  coded  in  MOOULA-2. 

[Zhongxiu  83]  Zhongxiu,  S.;  Du,  Z.;  Peigen,  Y. 

ZCZOS:  A  Distributed  Operating  System  for  a  LSI-1 1  Microcomputer  Network. 
Operating  Systems  Review  17(3):30-34,  July,  1983. 

Abstract 

Presents  ZOZOS.  the  operating  system  for  the  ZOZ  distributed  microcomputer  system.  The  system  may  be 
constructed  by  any  number  of  LSI- 11  microcomputers  in  any  structure,  although  for  the  time  being  the  authors 
have  only  5  machines  connected  in  a  tree  structure.  They  have  designed  the  ZOZ  system  for  investigating 
distributed  programming  as  well  as  for  teaching.  It  is  hoped  that  the  system  may  work  as  a  multiuser  time-sharing 
system  with  the  advantages  of  extensibility  and  robustness. 


A. 2.  Interprocess  Communication 

[Akkoyunlu  72]  Akkoyunlu,  Erap  A.,  Arthur  J.  Bernstein  and  Richard  E.  Schantz. 

An  Operating  System  for  a  Network  Environment. 

In  Proceedings,  Symposium  on  Computer-Communications  Networks  and 
Teietraffic,  pages  529-538.  Polytechnic  Institute  of  Brooklyn,  April,  1972. 

Abstract 

The  design  of  an  operating  system  for  a  network  environment  is  given.  Processes  in  the  system  utilize  the  same 
set  of  primitives  for  communicating  with  files,  devices,  or  other  processes.  This  permits  uniform  access  to  files 
regardless  of  their  physical  location  in  the  network.  This  system  has  a  modular  structure  similar  to  that  developed 
by  Oijkstra(l,2|. 

[Akkoyunlu  74]  Akkoyunlu,  Erap  A.,  Arthur  J.  Bernstein  and  Richard  E.  Schantz. 

Interprocess  Communication  Facilities  for  Network  Operating  Systems. 

Computer  7(6):46-55,  June,  1974. 

Abstract 

The  connection  of  several  computers  into  a  network  poses  new  problems  for  the  operating  system  designer.  In 
order  to  appreciate  these  problems  fully,  it  is  useful  to  look  briefly  at  networks  from  the  point  of  view  of  their  goals, 
their  possible  configurations,  and  their  level  of  integration. 

The  term  "computer  network”  refers  not  only  to  the  hardware  connection  between  several  computers,  but  also  to 
the  software  mechanisms  for  orderly  interaction  between  these  machines.  This  communication  facility  is  the 
crucial  factor  in  networks.  Typical  objectives  in  connecting  computers  into  a  network  are  load  sharing,  hardware 
resource  sharing,  and  software  resource  sharing. 

[Akkoyunlu  75]  Akkoyunlu,  Erap  A. 

On  the  Limitations  of  Acknowledgment  Messages. 

In  Proceedings.  SIGCOMM-SIGOPS  Interface  Workshop  on  Interprocess . 
Communications,  pages '37-39.  ACM,  March,  1975. 

Abstract 

An  important  decision,  made  early  in  the  design  of  an  interprocess  communication  (IPC)  facility,  is  the  amount  of 
information  the  system  undertakes  to  provide  the  sender  of  a  message  on  the  final  disposition  of  it.  From  the  point 
of  view  of  the  user,  the  sender  should  ideally  be  supplied  with  enough  status  information  to  allow  him  to  distinguish 
at  least  between  the  following  possibilities, 

1 .  the  message  reached  its  destination, 

2.  the  intended  receiver  is  not  currently  in  the  system, 

3.  there  was  a  transmission  error, 

4.  the  message  got  timed  out  (either  the  destination  process  itself  or  the  transmission  channel  was  tco 
busy  to  handle  the  message  with  a  specific  time  limit), 

since  each  of  these  alternatives  would  suggest  a  different  course  of  action, 

1.  go  on, 

2.  give  up, 

3.  try  again. 

4.  right  away,  re-transmit,  perhaps  later  ■  meanwhile  do  something  else. 

If  the  system  being  designed  has  a  high  degree  of  centralized  control  (as  when  the  appearance  of  parallel 
processing  is  created  by  multiplexing  a  ssingle  processor),  this  type  of  support  is  lanly  easy  to  provide  with  very 
little  loss  in  the  elegance  of  the  design,  so  that  there  is  no  problem. 

[Ball  76]  Ball,  J.  Eugene,  Jerome  Feldman,  James  R.  Low,  Richard  Rashid  and  Paul  Rovner. 

RIG,  Rochester’s  Intelligent  Gateway:  System  Overview. 

IEEE  Transaction  on  Software  Engineering  SE-2(4):321 -328,  December,  1976. 

Abstract 

Rochester's  Intelligent  Gateway  (RIG)  system  provides  convenient  access  to  a  wide  range  of  computing  facilities. 
The  system  includes  five  large  minicomputers  in  a  very  fast  internal  network,  disk  and  tape  storage,  a 
printer/plotter  and  a  number  of  display  terminals.  These  are  connected  lo  larger  campus  machines  (IBM  3G0/6S 
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and  DEC  KL10)  and  to  the  ARPANET.  The  operating  system  and  other  software  support  for  such  a  system  present 
some  interesting  design  problems.  This  paper  contains  a  high-level  technical  discussion  of  the  software  designs, 
many  of  which  will  be  treated  in  more  detail  in  subsequent  reports. 

[Ball  79a]  Ball,  J.  Eugene,  Edward  J.  Burke,  Ilya  Gertner,  Keith  A.  Lantz  and  Richard 

F.  Rashid. 

Perspectives  on  Message- Based  Distributed  Computing. 

In  Proceedings,  Computer  Networking  Symposium,  pages  46-51.  IEEE,  1979. 

Abstract 

At  the  University  of  Rochester  we  have  had  five  years  of  experience  in  the  design  and  implementation  of  a  multiple 
machine,  multiple  network  system  called  RIG.  The  design  of  RIG  is  based  on  a  model  of  distributed  computation 
-  independent  processes  communicating  only  by  messages  -  which  allows  programmers  to  ignore  the  details  of 
network  and  system  configuration.  This  paper  describes  those  aspects  of  the  RIG  design  which  make  this 
isolation  from  network  realities  possible.  In  addition,  we  describe  the  styles  of  message  communication  which 
have  evolved  in  RIG. 

[Ball  79b]  Ball,  J.  Eugene,  J.  R.  Low  and  G.  J.  Williams. 

Preliminary  ZENO  Language  Description. 

ACM  -  SIGPLAN  Notices  14(9):17-34,  September,  1979. 

Abstract 

The  specification  of  ZENO,  a  programming  language  intended  as  the  target  language  for  a  research  project  in 
advanced  compiling,  is  presented.  The  language  is  strongly  based  on  EUCLID,  with  modifications  for  message- 
based  parallel  processing,  and  a  somewhat  different  treatment  of  data  types. 

[Baizer  71  ]  Balzer,  R.  M. 

PORTS  -  A  Method  for  Dynamic  interprogram  Communication  and  Job  Control. 

In  Proceedings,  National  Computer  Conference,  pages  485-489.  AFIPS,  May,  1971 . 

Abstract 

Without  communication  mechanisms,  a  program  is  useless.  It  can  neither  obtain  data  for  processing  nor  make  its 
results  available.  Thus  every  programming  language  has  contained  communication  mechanisms.  These 
mechanisms  have  traditionally  been  separated  into  five  categories  based  on  the  entity  with  which  communication 
is  established.  The  five  entities  with  which  programs  can  communicate  are  physical  devices  (such  as  printers, 
card  readers,  etc.),  terminals  (although  a  physical  device,  they  have  usually  been  treated  separately),  files,  other 
programs,  and  the  monitor.  Corresponding  to  each  of  these  categories  are  one  or  more  communication 
mechanisms,  some  of  which  may  be  shared  with  other  categories. 

[Banino  80]  Banino,  Jean-Serge,  Alain  Caristan,  Marc  Guillemont,  Gerard  Morisset  and  Hubert 
Zimmermann. 

Chorus:  An  Architecture  for  Distributed  Systems. 

Technical  Report  42,  Institut  National  de  Recherche  en  Informatique  et  en 
Automatique  (INRIA),  November,  1980. 

Abstract 

The  CHORUS  project  deals  with  distributed  systems;  more  precisely,  it  investigates  the  impact  of  distribution  on 
operating  systems  and  on  execution  of  applications.  This  report  is  the  result  of  the  first  step  in  this  work.  It 
presents  successively: 

•  a  synthesis  of  the  main  advantages  and  constraints  of  distribution, 

•  a  model  for  the  execution  of  a  distriouted  application,  where  communication,  synchronization,  control, 
etc...  is  based  on  the  exchange  of  messages, 

•  a  model  for  the  construction  of  a  distributed  application  which  permits  to  turn  distribution  to  the  best 
account, 

•  examples  which  illustrate  various  aspiccts  of  the  architecture. 

This  report  presents  also  the  minimal  functions  required  from  a  kernel  of  operating  system  in  order  to  support 
execution  of  such  distributed  applications. 


[Barter  78] 


Barter,  C.  J. 

Communications  Between  Sequential  Processes. 

Technical  Report  34,  Department  of  Computer  Science,  University  of  Rochester, 
November,  1978. 

Abstract 

In  this  paper  we  consider  programs  which  are  designed  and  specified  as  systems  of  sequential  processes, 
communicating  with  each  other  explicitly,  by  passing  messages.  01  central  importance  in  such  systems  is  the  way 
in  which  communication  paths  or  connections  are  specified:  we  particularly  wish  to  point  to  the  works  of  Feldman 
(77),  Hoare  (77)  and  Milne  and  Milner  (77),  by  way  ol  contrast  with  each  other  and  with  the  present  work.  We  wish 
to  make  two  specific  proposals  conerning  the  specification  of  inter-process  communication:  the  first  defines  the 
“construction"  of  a  message  as  the  determining  attribute  of  message  passing,  the  second  gives  a  communications 
significance  to  the  structure  of  hierarchies  of  processes.  We  present  these  proposals  within  the  framework  of  a 
small  language. 

1.  Message  passing  is  the  sole  means  of  inter-process  communication,  thereby  excluding 
communication  via  common  data  or  global  variables. 

2.  For  the  specification  of  sequential  processes,  we  adopt  the  guarded  command  notation  of  Dijkstra 
(75),  together  with  Hoare's  (77)  extension  of  that  notation  to  include  the  possibility  of  an  "input 
command"  as  part  of  a  guard.  This  extension  greatly  enhances  the  guarded  command  notation  in  a 
multi-process  situation  (see  later). 

3.  We  assume  asynchronous,  buffered  communication,  and  a  few  convenient  operations  which  allow 
messages  to  be  treated  as  record-like  data  objects  (Feldman  (77)). 

[Baskett  77]  Baskett,  Forest,  John  H.  Howard  and  John  T.  Montague. 

Task  Communication  in  DEMOS. 

In  Proceedings,  Sixth  Symposium  on  Operating  Systems  Principles,  pages  23-31 . 
ACM,  November,  1977. 

Abstract 

This  paper  describes  the  fundamentals  and  some  of  the  details  of  task  communication  in  DEMOS,  the  operating 
system  for  the  CRAY-1  computer  being  developed  at  the  Los  Alamos  Scientific  Laboratory.  The  communication 
mechanism  is- a  message  system  with  several  novel  features.  Messages  are  sent  from  one  task  to  another  over 
links.  Links  are  the  primary  protected  objects  in  the  system;  they  provide  both  message  paths  and  optional  data 
sharing  between  tasks.  They  can  be  used  to  represent  other  objects  with  capability-like  access  controls.  Links 
point  to  the  tasks  that  created  them.  A  task  that  creates  a  link  determines  its  contents  and  possibly  restricts  its 
use.  A  link  may  be  passed  from  one  task  to  another  along  with  a  message  sent  over  some  other  link  subject  to  the 
restrictions  imposed  by  the  creator  of  the  link  being  passed.  The  link  based  message  and  data  sharing  system  is 
an  attractive  alternative  to  the  semaphore  or  monitor  type  of  shared  variable  based  operating  system  on  machines 
with  only  very  simple  memory  protection  mechanisms  or  on  machines  connected  together  in  a  network. 

[Bernstein  75]  Bernstein,  Arthur  J.  and  K.  Ekanadham. 

Inter-Process  Communication  in  a  Network. 

Infotech  State  of  the  Art  Reports 24):41 5-435, 1975. 

Abstract 

The  recent  trend  in  operating  system  development  has  been  increasingly  towards  large  and  complex  systems.  The 
introduction  of  computer  networks  has  only  served  to  compound  the  problem.  Unfortunately,  this  complexity  has 
brought  with  it  a  number  of  serious  problems.  The  cost  of  building  such  systems  is  enormous.  Development  time 
is  long  and  unpredictable,  system  modification  is  difficult  and  the  software  is  never  completely  debugged. 

In  order  to  overcome  these  difficulties  some  systems  have  been  constructed  in  a  modular  fashion.  The  code  to 
perform  a  particular  function  is  localized  to  a  single  module  and  functions  are  chosen  so  that  a  minimum  amount 
of  information  must  be  passed  across  module  boundaries.  Strict  conventions  are  established  concerning  the 
procedure  for  entering  tire  module  and  the  mechanism  for  passing  information  between  modules.  This  approach 
parallels  techniques  that  have  been  used  lor  many  years  in  the  development  of  computer  hardware. 

[Boebert  78a]  Boebert,  W.  Earl. 

The  HXDP  Executive  Interim  Report. 

Technical  Report  78SRC53,  Honeywell  Systems  &  Research  Center,  June,  1978. 

Abstract 

This  interim  report  presents  the  results  of  the  first  phase  ol  the  HXDP  executive  project. 
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The  activities  ol  this  phase  were  primarily  conceptual  and  speculative.  They  resulted  in  the  concepts  and  facilities 
of  the  executive,  as  well  as  a  collection  of  conclusions  and  observations  on  the  nature  of  software  in  the  HXDP 
environment.  The  report  gives  the  background  of  the  HXDP  executive  project,  describes  some  of  the  management 
and  procedural  practices  used,  and  lists  the  lessons  learned  about  research  or  advanced  development  projects  in 
the  software  area.  The  report  presents  the  resulting  concepts  and  facilities,  followed  by  a  rationale.  This  section 
traces,  as  completely  as  the  project  team  can  recall,  the  influence  of  the  various  objectives  and  constraints  on  the 
final  concepts  and  facilities  of  the  executive;  it  will  be  the  principal  section  of  interest  to  students  of  the  design 
process. 

The  report  concludes  with  some  general  observations  on  the  nature  of  software  for  distributed  systems  and  areas 
for  future  research. 

[Boebert  78b]  Boebert,  W.  Earl. 

Concepts  and  Facilities  of  the  HXDP  Executive. 

Technical  Report  78SRC21,  Honeywell  Systems  &  Research  Center,  March,  1978. 

Abstract 

This  document  presents  the  Concepts  and  Facilities  of  the  HXDP  Executive.  The  term  "Concept”  in  this  document 
refers  to  the  abstractions  an  application  programmer  uses  to  visualize  his  application,  describe  it  to  other 
programmers,  and  discuss  options  of  design;  "Facilities"  are  the  external  manifestations  of  the  Executive 
mechanisms  which  implement  the  concepts.  The  document  therefore  explicitly  presents  the  two  aspects  of  any 
general  purpose  system:  the  functions  it  provides  and  the  viewpoint  which  is  imposed  or  encouraged  by  their  use. 

[Boebert  78c]  Boebert,  W.  Earl,  William  R.  Franta,  E.  Douglas  Jensen  and  Richard  Y.  Kain. 
Decentralized  Executive  Control  in  Distributed  Computer  Systems. 

In  Proceedings,  COMPSAC  78,  pages  254-258.  IEEE,  November,  1978. 

Abstract 

This  paper  discusses  the  issues  involved  in  building  a  real-time  control  system  using  a  message-directed 
distributed  architecture.  We  begin  with  a  discussion  of  the  nature  of  real-time  software,  including  the  viability  of 
using  hierarchical  models  to  organize  the  software.  Next  we  discuss  some  realistic  design  objectives  for  a 
distributed  real-time  system  including  fault  isolation,  independent  module  verification,  context  independence, 
decentralized  control  and  partitioned  system  state.  We  conclude  with  some  observations  concerning  the  general 
nature  of  distributed  system  software. 

[Boebert  78d]  Boebert,  W.  Earl,  William  R.  Franta,  E.  Douglas  Jensen  and  Richard  Y.  Kain. 

Kernel  Primitives  of  the  HXDP  Executive. 

In  Proceedings,  COMPSAC  78,  pages  595-600.  IEEE,  November,  1978. 

Abstract 

This  paper  describes  the  kernel  of  an  Executive  being  implemented  for  the  Honeywell  Experimental  Distributed 
Processor  (HXDP)  -  a  vehicle  for  research  in  distributed  computers  for  real-time  control.  The  kernel  provides 
message  transmission  primitives  for  use  by  application  programs  or  higher  level  executive  functions.  In  the  paper 
we  describe  the  message  transmission  primitives  provided  by  the  the  kernel  and  the  rationale  for  their  selection 
based  upon  the  objectives  and  constraints  described  in  a  companion  paper. 

[Boebert  80]  Boebert,  W.  Earl,  Dennis  T.  Cornhill,  William  R.  Franta,  E.  Douglas  Jensen  and 
Richard  Y.  Kain. 

Communications  in  the  HXDP  Executive:  Design  Issues  and  Kernel  Primitives, 
[possibly  unpublished]. 

Abstract 

This  paper  describes  the  kernel  of  an  Executive  for  the  Honeywell  Experimental  Distributed  Processor  (HXDP)  -  a 
vehicle  for  research  in  distributed  computers  for  real-time  control.  The  kernel  provides  message  transmission 
primitives  for  use  by  application  programs  or  higher  level  executive  functions. 

We  begin  with  a  discussion  of  the  nature  ol  real-time  software,  including  the  viability  of  using  hierarchical  models 
to  organize  the  software.  Next  we  discuss  some  realistic  design  obiectives  for  a  distributed  real-time  system 
including  fault  isolaton,  independent  module  verification,  context  independence,  decentralized  control  and 
partitioned  system  state.  We  describe  the  message  transmission  primitives  provided  by  the  kernel  and  the 
rationale  for  their  selection  based  upon  the  objectives  and  constraints.  We  conclude  with  some  observations 
concerning  the  general  nature  of  distributed  system  software. 
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[Boebert  7?]  Boebert,  W.  Earl,  William  R.  Franta,  E.  Douglas  Jensen  and  Richard  Y.  Kain. 

The  HXDP  Executive:  Design  Issues  and  Kernel  Primitives. 

[possibly  unpublished]. 

Abstract 

This  paper  describes  the  kernel  of  an  Executive  being  implemented  for  the  Honeywell  Experimental  Distributed 
Processor  (HXDP)  •*  a  vehicle  for  research  in  distributed  computers  for  real-time  control.  The  kernel  provides 
message  transmission  primitives  for  use  by  application  programs  or  higher  level  executive  functions.  In  this  paper 
we  describe  some  objectives  and  constraints  of  real-time  control  systems,  the  message  transmission  primitives 
provided  by  the  kernel,  and  the  rationale  for  this  kernel  design,  based  upon  the  objectives  and  constraints.  We 
conclude  with  some  general  observations  on  the  nature  of  distributed  software. 

[Bos  81  ]  Bos,  Jan  van  den,  Rinus  Plasmeijer  and  Jan  Stroet. 

Process  Communication  Based  on  Input  Specifications. 

ACM  Transactions  on  Programming  Languages  and  Systems  3(3):224-250,  July, 
1981. 

Abstract 

Input  tools,  originally  introduced  as  a  language  model  for  interactive  systems  and  based  on  high-level,  input-driven 
objects,  have  been  developed  into  a  model  for  communicating  parallel  processes,  called  the  input  tool  process 
model  (ITP).  in  this  model  every  process  contains  an  input  rule,  comparable  to  the  right-hand  side  of  a  production 
rule.  This  rule  specifies  in  an  expression  the  patterns  and  sources  of  input  it  expects  and  where  the  input  is  to  be 
handled.  The  recep'  rm  of  the  input  triggers  action  inside  the  tool  process.  As  part  of  the  action,  messages  may 
be  sent  to  other  processes,  with. destination  specified  to  a  varying  degree  of  identification.  A  potential  candidate 
for  a  message  is  any  toot  process  with  the  correct  type  of  message  slot  Because  sending  tool  processes  do  not 
have  to  specify  completely  the  identity  of  receiving  tool  processes,  and  vice  versa.  ITP  provides  a  fully  dynamic 
communication  model.  Most  communication  aspects  of  other  recently  developed  models  are  contained  in  this 
model.  Synchronization  of  processes  is  accomplished  implicitly  by  the  input  specification;  explicit  synchronization 
constructs  such  as  monitors  and  guarded  regions  can  therefore  be  easily  simulated.  The  ITP  constructs  provide  a 
general  concept  for  interprocess  communication.  Its  application  areas  range  from  interaction  via  process  control 
to  operating  systems.  From  a  programming  point  of  view,  the  language  constructs  offered  are  not  in  any  way 
dependent  on  whether  processes  run  on  single  or  multiple  processors. 

[Brinch  Hansen  77] 

Brinch  Hansen,  Per. 

Network:  A  Multiprocessor  Program. 

In  Proceedings,  Computer  Software  &  Applications  Conference,  IEEE,  November, 
1977. 

[pages]. 

Abstract 

This  paper  explores  the  problems  of  implementing  arbitrary  forms  of  process  communication  on  a  multiprocessor 
network.  It  develops  a  Concurrent  Pascal  program  that  enables  distributed  processes  to  communicate  on  virtual 
channels.  The  canneis  cannot  deadlock  and  will  deliver  all  messages  within  a  finite  time.  The  operation,  structure, 
text,  and  performance  of  this  program  are  described,  ft  was  written,  tested,  and  described  in  2  weeks  and  worked 
immediately. 

[Brinch  Hansen  78] 

Brinch  Hansen,  Per. 

Distributed  Processes:  A  Concurrent  Programming  Concept. 

Communications  of  the  ACM  21(1 1):934- 942,  November,  1978. 

Abstract 

A  language  concept  for  concurrent  processes  without  common  variables  is  introduced.  These  processes 
communicate  and  synchronize  by  means  of  procedure  calls  and  guarded  regions.  This  concept  is  proposed  for 
real-time  applications  controlled  by  microcomputer  networks  with  distributed  storage.  The  paper  gives  several 
examples  of  distributed  processes  and  shows  how  they  include  procedures,  coroutines,  classes,  monitors, 
processes,  semaphores,  buffers,  path  expressions,  and  input/output  as  special  cases. 
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[Britton  80] 


Britton,  Dianne  E.  and  Mark  E.  Stickei. 

An  Interprocess  Communication  Facility  for  Distributed  Applications. 

In  Digest  ol  Papers.  COMPCON  80  Fall,  pages  590-595.  IEEE,  1980. 

Abstract 

When  an  application  is  distributed  across  several  processor  nodes,  the  facilities  available  for  communication  and 
synchronization  have  a  tremendous  influence  on  the  ease  with  which  the  application  program  can  be  designed, 
written,  and  understood.  This  paper  presents  a  framework  for  structuring  a  distributed  application  as  a  set  of 
concurrent  processes  and  describes  a  message-based  interprocess  communication  and  synchronization  facility. 
This  facility,  which  is  supported  in  a  prototype  implementation  by  a  kernel -executive  called  SUPPOSE,  is 
particularly  appropriate  for  loosely-coupled  networks  where  common  memory  cannot  be  assumed. 

[Cashin  80]  Cashin,  Peter  M. 

Inter-Process  Communication. 

Technical  Report  8005014,  Bell-Northern  Research,  June,  1980. 

Abstract 

This  report  gives  a  survey  of  both  procedure  oriented  and  message  oriented  inter  process  communication 
techniques.  It  compares  these  techniques  and  discusses  their  use  in  distributed  systems.  The  report  forms  the 
basis  for  lectures  to  be  given  at  the  NATO  Advanced  Study  Institute  on  Multiple  Processors.  Maratea.  Italy,  June 
1980. 

The  purpose  of  this  report  is  to  survey  and  compare  many  of  the  different  schemes  used  for  inter  process 
communication,  and  to  draw  out  the  key  issues  for  inter  process  communications  in  distributed  systems.  It  should 
become  dear  from  this  survey  that  inter  process  communications  are  far  from  being  well  understood,  several 
significant  steps  have  occurred  over  the  last  few  years  but  we  are  stil  some  way  from  a  wide  spread  concensus 
and  a  well  proven  set  of  tools. 

[Cheriton  79]  Cheriton,  David  R.,  Michael  A.  Malcolm,  Lawrence  S.  Melen  and  Gary  R.  Sager. 
Thoth,  a  Portable  Real-Time  Operating  System. 

Communications  of  the  ACM  22(2): 105-1 15,  February,  1979. 

Abstract 

Thoth  is  a  real-time  operating  system  which  is  designed  to  be  portable  over  a  large  set  of  machines.  It  is  currently 
running  on  two  minicomputers  with  quite  different  architectures.  Both  the  system  and  application  programs  which 
use  it  are  written  in  a  high-level  language.  Because  tire  system  is  implemented  by  the  same  software  on  different 
hardware,  it  has  the  same  interface  to  user  programs.  Hence,  application  programs  which  use  Thoth  are  highly 
protable.  Thoth  encourages  structuring  programs  as  networks  of  communication  processes  by  providing  efficient 
interprocess  communication  primitives. 

[Cheriton  80]  Cheriton,  David  R. 

A  Loosely  Coupled  I/O  System  for  a  Distributed  Environment. 

In  Proceedings,  IFIP  Working  Group  6.4  International  Workshop  on  Local  Networks 
lor  Computer  Communications,  pages  297-318.  IBM,  August,  1980. 

Abstract 

Tlie  design  of  a  loosely  coupled  I/O  system  is  presented  that  provides  a  byte -oriented  and  block-oriented  I/O 
abstraction  in  a  distributed  environment  The  design  is  based  on  a  simple  protocol  between  client  processes  and 
I/O  server  processes.  The  I/O  system  is  loosely  coupled  in  the  sense  that  it  exists  as  a  protocol  or  convention 
among  the  client  processes  and  the  server  processes. 

The  I/O  system  consists  of:  a  library  of  functions  that  implements  the  protocol  in  terms  of  a  set  of  message 
primitives,  a  set  of  participating  I/O  server  processes,  and  an  I/O  server  and  file  identification  scheme  that 
supports  symbolic  naming  of  files.  The  function  library  makes  this  underlying  structure  transparent  to  the 
application  programmer.  The  message  primitives  make  the  protocol  implementation  independent  of  the 
underlying  network  configuration  and  hardware. 

[Chesley  81  ]  Chesley,  Harry  R.  and  Bruce  V.  Hunt. 

Squire  •  A  Communications-Oriented  Operating  System. 

Computer  Networks  5(2):333-339, 1981. 

Abstract 

This  paper  presents  tho  architecture  of  a  communication-oriented,  real-time  operating  system  named  Squire.  The 
Squire  kernel  provides  memory  management,  preemptive  multitaking,  interprocess  communication,  and  the  ability 
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to  manage  data  outside  the  process  address  space,  as  well  as  services  such  as  timers.  User  processes  are 
protected  from  one  another  by  means  of  restrictions  on  what  objects  they  can  access  and  on  the  type  of  access. 
Squire  has  been  designed  to  provide  efficient  communication  between  cooperating  processes,  portability  to  new 
machine  architectures,  and  support  for  multiple  processor  and  distributed  processor  usage.  Protection,  reliability, 
and  robustness  have  been  major  design  goals.  Squire  supports  a  new  kind  of  object  called  chunks,  which  exist 
outside  the  process  address  space,  and  can  be  used  to  store  and  manage  data.  Squire  also  supports  a  means  for 
extending  the  kernel  in  a  controlled  manner;  this  mechanism  is  used  both  to  implement  such  traditional  functions 
as  device  drivers  and  to  provide  extended  kernel  services  not  present  in  the  basic  Squire  kernel. 

[Chessen  80]  Chesson,  G.  L.  and  A.  Q.  Fraser. 

Datakit  Software  Architecture. 

In  Digest  ot  Papers.  COMPCON  80  Spring,  pages  59-61 .  IEEE,  1980. 

Abstract 

Datakit  packet  switching  and  data  transmission  modules  provide  a  local  area  networking  capability  for  a  range  of 
applications  and  traffic  types.  The  extent  to  which  communication  facilities  of  this  kind  can  be  utilized,  extended, 
and  maintained  strongly  depends  on  the  nature  of  the  related  software  environment  The  software  evolving  with 
Datakit  represents  a  step  toward  a  set  of  general-purpose  software  building  blocks  that  can  be  used  with  different 
communication  hardware,  different  computers,  and,  to  some  degree,  with  different  operating  systems.. 

[Clark  82a]  Clark,  David.  0. 

Name,  Addresses,  Ports,  and  Routes. 

[RFC814]. 

Abstract 

It  has  been  said  that  the  principal  function  of  an  operating  system  is  to  define  a  number  of  different  names  for  the 
same  object  so  that  it  can  busy  itself  keeping  track  of  the  relationship  between  all  of  the  different  names.  Network 
protocols  seem  to  have  somewhat  the  same  characteristics.  In  TCP/IP.  there  are  several  ways  of  referring  to 
things.  At  the  human  visible  interface,  there  are  character  string  'names’  to  identify  networks,  hosts,  and 
services.  Host  names  are  translated  into  network  'addresses',  32-bit  values  that  identify  the  network  to  which  a 
host  is  attached,  and  the  location  of  the  host  on  dial  net  Service  names  are  translated  into  a  'port  identifier', 
which  in  TCP  is  a  16-bit  value.  Finally,  addresses  are  translated  into  'routes*  which  are  the  sequence  of  steps  a 
packet  must  take  to  reach  the  specified  addresses.  Routes  Show  up  explicitly  in  the  form  of  the  internet  routing 
options,  and  also  implicitly  in  the  address  to  route  translation  tables  which  all  hosts  and  gateways  maintain. 

This  RFC  gives  suggestions  and  guidance  for  the  design  of  the  tables  and  algorithms  necessary  to  keep  track  of 
these  various  sorts  of  identifiers  inside  a  host  implementation  of  TCP/IP. 

[Clark  82b]  Clark,  David  D. 

Modularity  and  Efficiency  in  Protocol  Implementation. 

[RFC817]. 

Abstract 

Many  protocol  implemented  have  made  the  unpleasant  discovery  that  their  packages  do  not  run  quite  as  fast  as 
they  had  hoped.  The  blame  for  this  widely  observed  problem  has  been  attributed  to  a  variety  of  causes,  ranging 
from  details  in  the  design  of  the  protocol  to  the  underlying  structure  of  the  host  operating  system.  This  RFC  will 
discuss  some  of  the  commonly  encountered  reasons  why  protocol  implementations  seem  to  run  slowly. 

Experience  suggests  that  one  of  the  most  important  factors  in  determining  the  performance  of  an  implementation 
is  the  manner  in  which  that  implementation  is  modularized  and  interpreted  into  the  host  operating  system.  For  this 
reason,  it  is  useful  to  discuss  the  question  of  how  an  implementation  is  structured  at  the  same  time  that  we 
consider  how  it  will  perform.  In  fact,  this  RFC  will  argue  that  modularity  is  one  of  the  chief  villains  in  attempting  to 
obtain  good  performance,  so  that  the  designer  is  faced  with  a  delicate  and  inevitable  tradeoff  between  good 
structure  and  good  performance.  Further,  the  single  factor  which  most  strongly  determines  how  well  this  conflict 
can  be  resolved  is  not  the  protocol  but  the  operating  system. 

[Collier  72]  Collier,  W.  W.  and  P.  H.  Gum. 

Wait- Free  Interprocess  Communication  Mechanisms. 

IBM  Systems  Journal  14(12),  May,  1972. 

Abstract 

The  following  series  of  programmable  routines  allow  one  specific  process  (l.e„  program)  in  a  computer  system  to 
send  an  indefinite  number  of  messages  to  exactly  one  other  process.  No  message  may  be  lost  or  received  out  of 
order.  Once  tl«e  sender  has  completed  sending  a  message,  the  receiver  must  be  able  to  receive  the  message  (this 
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rules  out  the  case  in  which  the  sender  tries,  but  tails,  to  send  a  message  and  so  it  tries  again  when  it  next  sends 
another  message).  Each  process  must  operate  in  such  a  fashion  that  the  other  process  cannot  ted  that  the  first 
process  is  active.  In  particular,  neither  process  can  wait  for  the  other  to  become  inactive,  such  as  in  a 
multiprogrammed  computer  system. 

[Cook  80]  Cook,  Robert  P. 

The  Starmod  Distributed  Programming  System. 

In  Digest  of  Papers,  COMPCON  80  Fall,  pages  729-735.  IEEE,  1980. 

Abstract 

Distributed  programming  is  characterized  by  high  communication  costs  and  the  absence  of  shared  variables  and 
procedures  as  synchronization  tools.  StarMod  is  a  language,  derived  from  Module,  which  is  intended  for  systems 
programming  in  the  network  environment.  The  StarMod  system  attempts  to  address  the  problem  areas  in 
distributed  programming  by  creating  an  environment  which  is  conducive  to  efficient  and  reliable  network  software 
construction.  The  StarMod  system  will  iodude  program  packages  for  compilation,  debugging,  and  software 
maintenance  as  well  as  for  performance  evaluation  and  modeling. 

[Cornhill  79]  Comhill,  Dennis  T.  and  W.  Earl  Boebert 
Implementation  of  the  HXDP  Executive. 

[CHI  939]. 

Abstract 

This  paper  describes  a  first  implementation  of  the  executive  for  the  Honeywell  Experimental  Distributed  Processor 
(HXDP).  HXDP  has  been  built  to  investigate  distributed,  decentralized  control  in  real  time  applications.The 
purpose  of  the  implementation  is  to  demonstrate  the  utility  of,  and  to  gain  experience  with  the  executive  primitives 
in  the  area  of  interprocess  communication. 

[Cox  ??]  Cox,  George  W.,  William  M.  Corwin,  Konrad  K.  Lail  and  Fred  J.  Pollack. 

Interprocess.  Communication  and  Processor  Dispatching  on  the  Intel  432. 
[submitted  for  publication,  1982]. 

Abstract' 

This  paper  describes  a  unified  facility  for  interprocess  communication  and  processor  dispatching  on  the  Intel  432. 
The  facility  is  based  on  a  queuing  and  binding  mechanism  called  a  port  The  paper  describes  our  goals  and 
motivations  for  ports,  both  abstract  and  implementation  view  of  ports  and  their  absolute  and  comparative 
performance. 

[Dallas  80]  Dallas,  I.  N. 

A  Cambridge  Ring  Local  Area  Network  Realisation  of  a  Transport  Service. 

In  Proceedings,  IF  IP  Working  Group  6.4  International  Workshop  on  Local  Networks 
for  Computer  Communications,  pages  271-296.  IBM,  August,  1980. 

Abstract 

A  Network  Independent  Transport  Service  has  been  defined  in  the  United  Kingdom.  From  the  service  description, 
various  protocols  can  be  derived  to  provide  the  service  over  particular  communications  media.  The  paper  gives  a 
brief  description  of  this  Transport  Service  and  goes  on  to  describe  its  realisation,  (encoding),  for  the  Cambridge 
Ring  Local  Area  Network  in  operation  at  the  University  of  Kent.  This  realisation  uses  an  existing  Ring  protocol. 
The  conclusions  derived  from  the  protect  are  given  at  the  end  of  the  paper. 

[Dannenberg  81  ]  Dannenberg,  Roger  B. 

AMPL:  Design,  Implementation,  and  Evaluation  of  a  Multiprocessor  Language. 
Technical  Report  ?,  Computer  Science  Department,  Carnegie- Mellon  University, 
March,  1981. 

Abstract 

AMPL  is  an  experimental  high-level  language  for  expressing  ^ararel  algorithms  which  involve  many 
interdependent  and  cooperating  tasks.  AMPL  is  a  strongly. typed  language  in  which  all  inter-process 
communication  takes  place  via  message  passing.  The  language  has  been  implemented  on  the  CM* 
multiprocessor,  and  a  number  of  programs  have  been  written  to  perform  numeric  and  symbolic  computation.  In 
this  report,  the  design  decisions  relating  to  process  communication  primitives  arc  discussed,  and  AMPL.  is 
compared  to  several  other  languages  lor  parallel  processing.  "Hie  implementation  of  message  passing,  process 
creation,  and  parallel  garbage  collection  are  described.  Muasw: uments  of  several  AMPL  programs  are  used  to 
study  the  effects  of  language  design  decisions  upon  program  performance  and  algorithm  design. 
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[Danthine  75]  Oanthine,  Andre'  A.  S.  and  Joseph  Bremer. 

Communication  Protocols  in  a  Network  Context. 

In  Proceedings.  SIGCOMM-SIGOPS  Interlace  Workshop  on  Interprocess 
Communications,  ACM,  March,  1975. 

Abstract 

It  the  problem  ot  process  cooperation  has  been  intensively  studied  since  1965  [5J.  it  is  much  more  recently 
[7. 12], that  the  attention  has  been  drawn  to  the  problems  associated  with  the  cooperation  ot  distributed  processes. 
In  this  case,  additional  problems  arise  from  the  communication  support  and  from  the  environment  disparities  in 
terms  ot  space,  time  and  function. 

in  a  distributed  computer  network  based  on  a  packet-switched  communication  subnet  it  is  possible  to  describe 
the  system  as  a  hierarchical  structure.  We  will  consider  here  only  three  levels: 

•  Level  0 :  communication  between  nodes  ot  the  subnets. 

•  Level  1  :  communication  between  hosts  connected  to  the  net 

e  Level  2 :  communication  between  users  processes  (subscribers)  in  different  hosts. 

At  each  level,  the  communication  is  based  on  a  protocol  and  the  data  structures  to  be  exchanged  are  different. 
For  instance  the  data  structure  exchanged  at  level  2  may  be  a  sequential  file.  Aa  there  is  no  direct  support  of 
communication  at  this  level,  it  is  through  the  mechanisms  of  level  1  and  0  that  the  transfer  will  take  place  with  data 
structure  at  each  level  not  directly  related  to  the  upper  level  one.  The  complete  definition  of  the  system  will 
therefore  required  not  only  me  level  0.  1  and  2  protocols,  but  also  inter-level  protocols. 

In  the  following,  we  will  concentrate  on  the  level  1  protocol.  It  is  the  basic  communication  protocol  of  a  network 
since  it  will  be  used  by  user  processes  in  different  hosts  and  it  will  use  the  subnet  as  a  communication  support 
This  level  1  entity  ts  called  a  "TS"  (transport  station)  in  CYCLADES  [131  and  a  "TCP”  (transmission  control 
program)  in  [2]. 

[Oanthine  80]  -Danthine,  Andre'  A.  S.  and  F.  Magnee. 

Transport  Layer  -  Long-Haul  Vs.  Local  Network. 

In  Proceedings.  IFIP  Working  Group  6.4,  Internationa!  Workshop  on  Local 
Networks,  pages  271-296.  IBM,  August,  1980. 

Abstract 

A  computer  network  may  be  considered  as  a  set  of  cooperating  ditributes  processes  organized  in  a  hierarchical 
structure. 

[Danthine  81  ]  Danthine,  Andre'  A.  S. 

Design  Principles  of  Communication  Protocols. 

In  Data  Communication  and  Computer  Networks,  pages  257-273.  IFIP,  1981. 

Abstract 

The  central  concept  in  a  hierarchical  model  of  a  computer  network  is  the  transport  service.  Besides  providing  the 
processes  with  network  wide  name  space  it  allows  a  connection  oriented  communication.  The  designer  choices 
are  related  to  the  method  to  construct  the  network  wide  name  space,  the  data  elements  on  which  is  based  the 
connection  oriented  communication,  the  existence  of  interrupt  facilities  and  of  lettergram  communication. 

The  process/transport  interface  allows  the  access  not  only  to  the  network  wide  service  but  also  to  the  additional 
services  which  may  be  offered  by  a  station  to  its  local  processes. 

The  transport  protocol  is  responsible  lor  achieving  the  service  offered  and  is  based  on  the  expected  performances 
of  the  transmission  service.  The  designer  choices  are  related  to  the  connection  opening  scheme  and  the  data 
elements  on  which  is  based  on  the  error  control  and  the  flow  control 

[desJardings  75]  desJardings,  Richard. 

Semantic  Notions  for  Interprocess  Communication. 

In  Proceedings,  SIGCOMM-SIGOPS  Interlace  Workshop  on  Interprocess 
Communications,  pages  159-162.  ACM,  March,  1975. 

Abstract 

It  is  proposed  that  a  process,  operating  within  a  virtual  address  space  (VAS),  always  communicate  with  alt 
processes  outside  its  VAS  by  a  single  uniform  mechanism.  This  mechanism,  which  may  be  implicit  rather  than 
explicit  in  a  procesor  with  virtual  addressing  hardware,  is  a  set  of  genet  olized  I/O  pnmatives.  as  suggested  in 


some  detail  by  (Akkoyynlu.  Bernstein,  and  Schantz].  Such  a  primitive  designates  a  segment  (buffer)  to  be  realized 
in  the  VAS  of  a  designated  receiving  process.  We  consider  the  problems  of  designating  the  receiver  process  in  the 
sequel,  and  for  now  concentrate  on  the  structure  of  the  sender. 

[Didic  82]  Didic,  Milena  and  Bemd  Wolfinger. 

Simulation  of  a  Local  Computer  Network  Architecture  Applying  a  Unified  Modeling 
System. 

Computer  Networks  6{1):75-91, 1982. 

Abstract 

A  modeling  system  is  described  which  allows  us  to  determine  qualitative  and  quantitative  interdependences 
between  system  parameters  for  protocol  hierarchies  and  for  various  organizations  of  services  in  computer 
networks.  Its  use  is  shown  for  modeling  a  local  resource  sharing  network  specified  in  terms  of  the  ISO  Reference 
Model  of  Open  Systems  Interconnection.  The  simulation  is  intended  to  help  both  during  the  design  phase  of  the 
network  architecture  in  order  to  find  an  optimum  design  solution  and  subsequently,  after  implementation,  to 
investigate  trade-offs  among  various  network  configurations.  Experimental  results  to  increase  the  efficiency  of  a 
network  configuration  we  given. 

[Ekanadham  75]  Ekanadham,  K.  and  Arthur  J.  Bernstein. 

The  Structure  of  Interprocess  Communication. 

In  Proceedings,  SIGCOMM-SIGOPS  Interlace  Workshop  on  Interprocess 
Communications,  pages  28-30.  ACM,  March,  1975. 

Abstract 

The  purpose  of  this  work  is  to  make  some  general  observations  about  the  structure  of  any  interprocess 
Communication  mechanism  (IPCM).  Much  of  the  work  in  this  area,  to  date,  has  confined  itself  to  the  design  of  a 
specific  IPCM  to  meet  the  needs  of  a  particular  operating  system  environment.  It  is  our  feeling  that  there  are 
certain  basic  principles  which  underlie  the  structure  of  any  IPCM.  An  understanding  of  these  principles  should 
help  to  clarify  various  tradeoffs  and  shed  some  light  on  the  design  process. 

[Elovitz  74]  Elovitz,  Honey  S.  and  C  jnstance  L.  Heitmeyer. 

What  is  a  Computer  Network? 

In  IEEE  1974  NTC  Record,  pages  1007-1014.  IEEE,  1974. 

Abstract 

A  recent  trend  in  computer  systems  has  been  the  use  of  data  transmission  and  packet  switching  technology  to 
construct  what  are  commonly  referred  to  as  'computer  networks'.  There  is  an  important,  yet  often  unmentioned, 
distinction  among  such  networks,  namely  between  'computer  communications  networks'  and  'computer 
networks'.  In  a  'computer-communications  network',  the  user  must  explicitly  manage  the  computer  resources. 
In  a  'computer  network',  these  resources  are  managed  automatically  by  a  network  operation  system.  Most  of  the 
existing  ‘computer  networks"  such  as  TYMNET  and  ARPANET  are  more  accurately  labeled  "computer- 
communications  networks'. 

This  paper  intends  to  remove  the  obscurityfrom  the  term  "computer  network*  by  characterizing  the  differences 
between  "computer  networks"  and  ‘computer-communications  networks”.  Several  existing  networks  are 
described  and  classified. 

[Enslow  79]  Enslow  Jr.,  Philip  H.  and  Robert  L.  Gordon. 

Interprocess  Communication  in  Highly  Distributed  Systems  ■■  A  Workshop  Report. 
Technical  Report  GIT-ICS-79/09,  Georgia  Institute  of  Technology,  December, 

1979. 

Abstract 

The  subject  of  the  workshop  is  Interprocess  Communication  Mechanisms  with  a  particular  focus  on  process  to 
process  communications  in  highly  distributed  systems.  Highly  distributed  systems  are  characterized  by  a  very  high 
degree  of  loose-coupling  between  physical  resources  as  well  as  between  logical  resources  plus  dynamic, 
short-term  changes  in  the  topology  and  organization  of  the  total  system.  These  characteristics  place  new 
requirements  on  the  design  and  performance  of  I  PC  mechanisms  that  are  assuming  extreme  importance  in 
advancing  the  state-of-the-art  in  all  forms  of  distributed  systems. 


[Farber  72a]  Farber,  David  J.  and  Kenneth  C.  Larson. 

The  System  Architecture  of  the  Distributed  Computer  System--The 
Communications  System. 

In  Proceedings,  Symposium  on  Computer-Communications  Networks  and 
Teletraftic,  pages  21-27.  Polytechnic  Institute  of  Brooklyn,  April,  1972. 

Abstract 

The  Distributed  Computing  System  (DCS)  is  an  experimental  computer  network  under  study  at  the  University  of 
California  at  Irvine  under  NSF  funding.  The  network  has  been  designed  with  the  following  goals  in  mind:  reliability, 
low  cost  facilities,  easy  addition  of  new  processing  services,  modest  startup  cost,  and  low  incremental  expansion 
cost.  The  structure  chosen  to  achieve  these  goals  is  a  digital  communications  ring  using  T 1  technology  and  fixed 
message  lengths.  The  computers  used  are  small  to  medium  scale  and  are  interlaced  to  the  ring  using  a  fairly 
sophisticated  piece  of  hardware  called  a  Ring  interface  (Rf). 

There  are  two  features  which  make  the  communications  protocols  unique.  First,  messages  are  addressed  to 
processes,  not  processors.  The  e  accomplished  by  placing  an  associative  store  in  each  Rl.  The  store  contains 
the  names  of  all  processes  active  on  the  attached  processor.  When  a  message  arrives  over  the  ring,  the 
destination  process  name  e  matched  against  the  associative  store.  If  a  match  occurs  the  message  is  copied  and 
passed  over  the  nng  to  the  next  Rl.  Second,  messages  are  only  removed  at  the  Rl  from  which  they  originate.  The 
ring  may  be  thought  of  as  a  series  of  message  slots.  To  transmit  a  message  the  Rl  waits  for  an  empty  slot  and 
places  the  message  on  the  ring.  The  message  is  copied  when  necessary  as  it  is  removed  from  the  ring.  If  errors 
are  detected  or  the  message  fails  to  return  in  a  specific  amount  of  time  the  message  is  retransmitted.  The 
retransmission  causes  problems  since  RIs  may  receive  multiple  copies  of  the  message.  The  paper  describes  a 
scheme  for  sequencing  messages  which  removes  these  problems.  Note  that  this  scheme  allows  messages  top  be 
broadcast  to  all  processes  or  a  class  of  processes.  The  DCS/OS  software  uses  this  feature  extensively.  The 
paper  also  discusses  the  error  detection  and  maintenance  features.  Basically,  each  Rl  has  a  snort  circuit  which 
removes  it  from  the  ring  which  maintaining  the  ring  connectivity.  This  short  circuit  can  be  activated  through 
internal  checks  within  the  Rl  or  externally  by  specific  messages.  Redundancy  of  communication  paths  in  the  ring 
protects  the  ring  connectivity. 

[Farber  72b]  Farber,  David  J.  and  Kenneth  C.  Larson. 

The  Structure  of  a  Distributed  Computing  System-Software. 

In  Proceedings,  Symposium  on  Computer-Communications  Network  and 
Teletraftic,  pages  539-544.  Polytechnic  Institute  of  Brooklyn,  April,  1972. 

Abstract 

This  paper  describes  a  software  system  which  allows  the  control  of  a  network  of  small  processors  to  be  distributed 
among  the  processors  on  the  network.  The  design  goafs  for  the  software  system  are  presented,  the  primary  goal 
being  that  the  network  be  fail-solt  (Section  V).  The  hardware  used  to  implement  the  network  is  then  described. 
The  unique  feature  of  the  hardware  is  a  technique  of  message  addressing  which  allows  processes  to  communicate 
with  no  knowledge  of  each  other's  physical  location  in  the  network.  The  next  section  shows  the  ways  in  which  the 
operating  system  was  shaped  by  the  design  goafs  and  describes  the  interprocess  communications  scheme  and 
some  of  the  basic  characteristics  of  the  operating  system.  A  more  detailed  description  of  the  entire  operating 
system  is  then  presented,  in  particular  showing  the  ways  in  which  the  responsibility  for  resouce  allocation  and 
scheduling  is  distributed  among  the  separate  processors.  The  software  which  maintains  the  network  is  described 
and  examples  of  error  conditions  and  recovery  or  checking  procedures  are  given.  The  future  plans  for  the 
network  are  presented. 

[Farber  76]  Farber,  J.  and  R.  Pickens. 

The  Overseer,  a  Powerful  Communications  Attribute  for  Debugging  and  Security  in 
Thin-Wire  Connected  Control  Structures. 

In  Proceedings,  Third  International  Conference  on  Computer 
Communication ,  pages  441 -451.  August,  1976. 
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Abstract 

Thin  wire  communications,  otherwise  known  as  serial  message  sending,  encourages  modularity  in  distributed 
program  design  and  makes  visible  the  interprocess  communications  streams  to  an  unprecedented  degree.  In  this 
paper,  a  powerful  process  monitoring  capability,  the  overseer  function,  is  proposed  to  aid  the  program  developer 
in  guaranteeing  the  dynamic  correctness  of  his  distributed  process  mix.  The  top  down  design  process  is 
overviewed  with  the  emphasis  on  generating  an  analyzable  model  of  the  intra-module  control  structure.  With 
appropriate  augmentation  ol  interprocess  communications  streams  it  is  feasible  to  endow  the  communications 


ass 
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with  a  control  sequence  validation  capability.  The  need  lor  dynamic  changing  process  contexts  is  discussed,  and 
the  overseer  is  shown  to  be  capable  ol  emulating  this  level  ol  process  behavior.  Path  verification  (for  protection) 
and  single  channel  monitoring  (for  dynamic  probing)  are  two  final  attributes  which  may  usefully  be  part  of  the 
overser  function.  Overall  the  overseer  is  only  a  part  of  a  systematized  process  for  distributed  system  design,  but 
promises  great  potential  in  improving  the  visibility  of  dynamic  process  behavior  in  distributed  systems. 

[Feldman  79]  Feldman,  Jerome  A. 

High  Level  Programming  for  Distributed  Computing. 

Communications  of  the  ACM  22(6):353-368,  June,  1979. 

Abstract 

Programming  for  distributed  and  other  loosely  coupled  systems  is  a  problem  of  growing  interest  This  paper 
describes  an  approach  to  distributed  computing  at  the  level  of  general  purpose  programming  languages.  Based 
on  primitive  notions  of  module,  message,  and  transaction  key.  the  methodology  is  shown  to  be  independent  of 
particular  languages  and  machines.  It  appears  to  be  useful  for  programming  a  wide  range  of  tasks.  This  part  of  an 
ambitious  program  of  development  in  advanced  programming  languages,  and  relations  with  other  aspects  of  the 
protect  are  also  discussed. 

[Fjellheim  79]  Fjellheim,  Roar  A. 

A  Message  Distribution  Technique  and  its  Application  to  Network  Control. 
Software-Practice  and  Experience  9(?):499-505,  June,  1979. 

Abstract 

The  patterns  of  message  exchange  in  distributed  computer  systems  can  become  sufficiently  complex  to  justify  the 
construction  of  communication  services  that  extend  the  basic  message  transmission  mechanism.  A  simple 
method  for  implementing  a  copy  distribution,  or  broadcast,  service  is  described.  It  is  shown  how  the  method  can 
support  command  and  monitoring  functions  in  a  computer  communication  network. 

[Fleisch  81  ]  FTeisch,  Brett  D. 

An  Architecture  for  Pup  Services  on  a  Distributed  Operating  System. 
SIGOPS-Operating  Systems  Review  15(1):26-44,  January,  1981. 

Abstract 

At  the  University  Ol  Rochester  the  computer  science  department  has  had  six  years  of  experience  in  the  design  and 
implementation  ol  a  multiple-machine,  multiple  network  distributed  system  called  RIG.  Rochester's  Intelligent 
Gateway  (RIG)  [1.2.3]  is  a  dual  processor  gateway  which  connects  three  computer  networks  to  provide  convenient 
access  to  a  wide  range  of  computer  facilities.  RIG  was  built  to  serve  as  an  intermediary  between  the  human  user 
(working  through  a  display  terminal  or  personal  computer)  and  a  variety  of  computer  systems.  The  bulk  of  the 
user's  computational  requirements  is  met  by  these  systems,  which  are  either  partially  integrated  into  the  RIG 
system  through  a  fast  local  network  or  loosely  coupled  to  it  through  the  ARPANET.  RIG  also  provides  a  number  of 
basic  services  such  as  printing,  plotting,  local  file  storage,  and  support  for  a  number  of  display  terminals. 

This  paper  presents  an  architecture  for  Pup  services  on  RIG.  Pup  is  the  name  of  an  internetwork  packet  format 
(PARC  Universal  Packet),  a  hierarchy  ol  protocols  and  a  style  of  internetwork  communication  [4],  These  services 
proposed  provide  access  to  Pup  interprocess  communication  primitives  on  a  distributed  operating  system.  The 
motivation  for  this  design  is  twofold.  First,  we  wish  to  develop  a  framework  in  which  processes  may  perform 
network  communication  using  a  wide  variety  of  interprocess  communication  styles,  selectable  by  the  process 
upon  initialization.  These  styles  are  necessary  because  of  the  diversity  of  protocols  in  the  environment.  Moreover, 
this  framework  must  extend  an  environment  that  has  provided  logical  centralization  of  distributed  resources. 
Second,  we  wish  to  integrate  some  new  functions  into  our  message  based  operating  system  which  are  not 
currently  provided.  Although  many  services  have  been  provided  by  RIG.  the  provision  of  Pup  services  will  give  us 
added  flexibility. 

[Folts  80]  Folts,  Harold  C. 

X.25  Transaction-Oriented  Features  •  Datagram  and  Fast  Select. 

IEEE  Transactions  on  Communications  COM-28(4):496-500,  April,  1980. 

Abstract 

The  latest  proposed  revisions  to  CCITT  Recommendation  X.25  for  packet-switching  service  in  public  data 
networks  now  include  two  new  capabilities  suitable  for  transport  of  small  amounts  of  data.  The  first  provides 
datagram  service  for  the  transport  of  independent  "message  type"  packets.  The  othor  new  feature  is  the  fast 
select  facility  which  provides  for  the  inclusion  of  120  octets  ol  user  data  m  the  call  establishment  packets  for  virtual 
call  service.  Both  these  new  provisions  greatly  enhance  the  capability  of  X.25  to  efficiently  support  the  broadest 
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range  of  user  applications. 


[Ford  76]  Ford.  W.  S.  and  V.  C.  Hamacher. 

Hardware  Support  for  Inter-Process  Communication  and  Processor  Sharing. 

In  Proceedings,  Third  Annual  Symposium  on  Computer  Architecture,  pages 
113-118.  IEEE,  January,  1976. 

Abstract 

The  abstraction  of  a  computer  system  as  a  set  of  asynchronous  communicating  processes  is  an  important  system 
concept.  This  paper  indicates  how  the  concept  could  be  supported  at  a  low  hardware  level.  A  new.  inter-process 
communication  mechanism  called  a  mailbox  is  introduced.  Examples  of  its  use  as  a  programming  tool  are  given. 
This  is  followed  by  a  description  of  hardware  features  that  use  this  mechanism  as  the  basis  of  communication 
between  the  components  of  a  complete  system  These  features  include  processor -sharing  hardware -capable  of 
handling  process  selection  and  switching  with  high  efficiency,  it  is  also  indicated  how  these  features  can  take  the 
place  of  conventional  input/output  structures. 

[Ford  77]  Ford,  W.  S.  and  V.  C.  Hamacher. 

Low  Level  Architecture  Features  for  Supporting  Process  Communication. 

The  Computer  Journal  20(2):  156- 162,  May,  1977. 

British  Computer  Society. 

Abstract 

A  proposal  is  presented  for  low  level  hardware  features  which  would  assist  in  the  realisation  of  the  abstraction  of  a 
computer  system  as  a  set  of  asynchronous  communicating  processes.  A  low  level  synchronisation  and 
communication  mechanism,  called  a  mailbox,  a  described,  togeather  with  details  of  a  hardware  structure  for 
configuring  a  complete  system  around  a  set  of  these  mailboxes.  Programming  for  this  architecture  is  then 
discussed.  It  a  shown  how  the  new  features  can  be  used  for  controlling  input/output,  and  for  handling  general 
synchronization. 

[Forsdick  81  ]  Forsdick,  Harry  C..  William  I.  MacGregor,  Richard  E.  Schantz,  Steven 
A.  Swemofsky,  Robert  H.  Thomas  and  Stephen  G.  Toner. 

Distributed  Operating  System  Design  Study:  Final  Report. 

Technical  Report  4674,  Bolt  Beranek  and  Newman  Inc.,  May,  1981. 

Abstract 

A  Distributed  Operating  System  (DOS)  is  made  from  many  interacting  parts.  The  architecture  for  a  COS  is  the 
organization  and  relationships  between  the  various  components,  programs,  and  protocols  that  make  up  the 
distributed  computer  system.  Specifying  a  basic  architecture  for  a  DOS  serves  several  purposes.  It  provides  an 
integrated  framework  to  which  refinements  in  the  areas  of  our  special  concern  (Global  Resource  Control  and 
Reliability)  may  be  made.  An  explicit  architecture  records  many  implications  of  the  goals  stated  in  the  previous 
Chapter  which  are  system-wide  implications.  Finally,  an  architectural  framework  places  some  boundaries  on 
subsequent  aspects  of  the  emerging  design. 

[Franta  81]  Franta,  William  R.,  E.  Douglas  Jensen,  Richard  V.  Kain  and  George  D.  Marshall. 
Real-Time  Distributed  Computer  Systems. 

Advances  in  Computers  20:39-82,  1981. 

Abstract 

Distributed  computer  systems,  containing  several  computers,  may  provide  increased  system  availability  and 
reliability.  Their  design  is  complex,  involving  the  design  of  communications  mechanisms  in  hardware  and  software 
and  the  selection  of  policies  and  mechanisms  for  distributed  sytem  control.  The  complex  design  issues  may  have 
simple  solutions  in  well-understood  application  environments:  the  real-time  control  environment  is  one  such 
environment  For  these  reasons,  some  early  distributed  computer  system  development  projects  have  focused  on 
the  real-time  application  environment 

In  this  contribution  we  cover  real-time  distributed  computer  systems  from  promise  through  design  and 
implementation.  First,  we  discuss  the  motivation  lor  distributed  computer  systems  in  terms  of  possible  system 
characteristics  attained  by  distributing  the  computational  resources  and  then  we  characterize  the  real-time  control 
application  environment.  In  subsequent  sections  we  review  the  options  and  issues  related  to  hardware  and 
software  designs  for  distributed  systems,  and  accompany  the  general  discussions  with  the  details  of  the  design 
and  implementation  of  the  Honeywell  Experimental  Distributed  Computing  (Processor)  system,  known  as  HXDP. 
The  HXDP  project  hardware  design  began  in  1974,  was  realized  in  1976.  and  system  software  design  and 
realization  were  completed  in  1976.  Applications  experiments  are  continuing  in  i960. 


[Galtieri  80}  Galtieri,  Cesare  A. 

Architecture  for  a  Consistent  Decentralized  System. 

Technical  Report  36132,  IBM,  June,  1980. 

Abstract 

This  paper  has  three  principal  aims.  First,  to  set  forth  a  definition  of  system  consistency  which  allows  for 
decentralized  systems  and  which  is  as  free  as  possible  of  hidden  implementation  assumptions.  Second,  to 
propose  a  system  architecture  which  guarantees  consistency,  in  the  sense  previously  defined.  Third,  to  outline  an 
implementation  approach  for  a  simple  data  management  function. 

The  general  intent  of  the  proposal  is  to  achieve  a  high  degree  of  independence  and  concurrency  among  the 
various  components  of  a  decentralized  system  consistency.  In  particular  our  approach 

•  allows  for  concurrency  within  a  transaction,  a  capability  which  is  important  for  the  effective  support  of 
complex  transactions  in  a  decentralized  environment; 

e  guarantees  maximal  apparent  concurrency  among  transactions; 

•  facilitates  the  support  of  more  selective  operations  which,  in  most  cases,  transform  apparent 
concurrency  into  real  concurrency. 

[Garlick  ??]  Garlick,  Lawrence  L.,  Raphael  Rom  and  Johathan  B.  Postal. 

Issues  in  Reliable  Host-to-Host  Protocols. 

Abstract 

Fully  reliable  network  host-to-host  protocols  have  recently  gained  significant  attention,  primarily  due  to  more 
stringent  security  requirements  of  network  users.  This  paper  will  discuss  issues  related  to  one  such  protocol, 
which  is  supported  by  the  Transmission  Control  Program  (TCP).  The  protocol,  first  introduced  in  1974,  features 
end-to-end  positive  acknowledgement  retransmission,  internetwork  addressing  capabilities,  and  ordered  delivery. 

The  issues  of  interest  in  this  paper  are  protocol  correctness  and  completeness,  protocol  efficiency,  and  complexity 
of  implementation.  The  discussion  will  suggest  alterations  and  extensions  to  TCP. 

Flow  control  heuristics  using  TCP's  windowing  techniques  are  explored.  Flow  control  information  is  augmented  to 
allow  fair  apportionment  of  bandwidth,  better  bandwidth  utilization  through  optimistic  credits,  flow  control  credits 
matched  to  the  type  of  traffic,  and  increased  performance  tpr  high  precedence  connections. 

An  alternative  lor  selecting  the  startup  sequence  number  of  a  connection  is  presented.  It  is  suggested  that  the 
resynchronization  method  for  sequence  number  space  management  should  be  abandoned  because  it  is  overly 
complicated  and  can  actually  fail  when  the  data  stream  is  stopped  by  flow  control. 

The  need  for  the  separation  of  dan  and  control  channels  is  motivated,  introducing  the  notion  of  a  reliable 
subchannel. 

The  findings  are  presented  both  to  further  the  understanding  of  reliable  protocols  and  to  encourage  intelligent 
implementations  of  TCP. 

[Gehringer  81  ]  Gehringer,  Edward  F.  and  Robert  J.  Chansler  Jr. 

StarOS  User  and  System  Structure  Manual. 

Technical  Report  ?,  Department  of  Computer  Science,  Carnegie-Mellon  University, 
June,  1981. 

[not  released  as  of  Jan.  82]. 

Abstract 

Technological  advances  have  made  it  attractive  to  interconnect  many  less  expensive  processors  and  memories  to 
construct  a  powerful,  cost-effective  computer.  Potential  benefits  include  increased  cost-performance  resulting 
from  the  exploitation  of  many  cheap  processors,  enhanced  reliability  in  the  integrity  of  data  and  in  the  availability 
of  useful  processing  power,  and  a  physically  adaptable  computer  whose  capacity  can  be  expanded  or  reduced  by 
addition  or  removal  of  modular  components.  Realizing  these  potential  benefits  requires  software  structures  that 
make  effective  use  of  the  hardware.  StarOs  is  a  message- based,  obiect-onented.  multiprocessor  operating 
system,  specifically  designed  to  support  task  forces,  large  collections  of  concurrently  executing  processes  that 
cooperate  to  accomplish  a  single  purpose.  StarOS  has  been  implemented  at  Carnegie-  Mellon  University  on  the  SO 
processor  Cm*  multimicroprocessor  computer. 


[Gentleman  80]  Gentleman,  W.  Morven  and  J.  E.  Corman. 

Design  Considerations  for  a  Local  Area  Network  Connecting  Diverse  Primitive 
Machines. 

In  Proceedings.  IFIP  Working  Group  6.4  International  Workshop  on  Local  Networks 
lor  Computer  Communications,  pages  207-221.  IBM,  August,  1980. 

Abstract 

Local  area  networks  have  typically  been  designed  to  connect  remote  peripherals  and  concentrators  to  host 
machines,  or  to  interconnect  homogeneous  computers  running  a  common  network  operating  system,  or  to 
interconnect  substantial  self-sufficient  computer  systems.  The  objective  tor  the  network  we  are  constructing  are 
quite  different.  Most  of  the  network  subscribers  will  be  primitive  machines  of  diverse  types.  This  has  implications 
which  strongly  affect  design  decisions  in  the  network  hardware  and  software. 

Firstly,  it  means  the  subscriber  hardware  is  inexpensive,  so  to  maintain  balance,  the  network  cost  per  subsciber 
must  be  low,  and.  in  particular,  the  hardware  interlace  to  the  network  must  be  inexpensive  too. 

Secondly,  it  means  that  the  machines  are  of  many  architectures,  so  a  standard  port  to  the  subscriber  computer 
must  be  used;  building  custom  hardware  for  each  machine  type  is  infeasabte. 

These  two  imply  the  network  must  present  an  interface  to  a  standard  serial  communications  port,  or  to  a  standard 
parallel  port,  if  such  can  be  defined. 

Third,  it  means  that  the  operating  systems  of  the  subscribers  will  be  quite  different,  indeed,  the  same  subscriber 
may,  at  different  times,  run  several  incompatable  systems,  and  one  of  the  requirements  of  the  network  is  to  be  able 
to  download  such  systems.  This  implies  the  outer-most  level  of  communications  protocol  must  be  very  simple, 
perhaps  byte-stream  with  preset  virtual  circuits.  More  flexible  protocols  must  be  built  on  top  of  this. 

Fourth,  there  is  no  central  machine:  groups  of  subscribers  can  be  expected  to  communicate  heavily  among 
themselves,  but  rarely  with  others.  Their  higher-level  protocols  should  suit  them.  File  transfer  will  be  the  main 
activity. 

This  paper  discusses  these  and  other  factors,  shows  why  most  existing  network  designs  are  inappropriate  in  this 
context  then,  by  describing  the  network  being  built  at  the  University  of  Waterloo,  illustrates  that  suitable  designs 
are  possible. 

[Giloi  81  ]  Giloi,  W.  K.  and  P.  Behr. 

An  IPC  Protocol  and  its  Hardware  Realization  for  a  High-Speed  Distributed 
Multicomputer  System. 

In  Proceedings,  Eighth  Annual  Symposium  on  Computer  Architecture,  pages 
481-493.  IEEE  and  ACM,  1981. 

Abstract 

Multicomputer  systems  with  distributed  control  form  an  architectue  that  simultaneously  satisfies  such  design  goals 
as  high  performance  through  parallel  operation  of  VLSI  processors,  modular  extensibility,  fault  tolerance,  and 
system  software  simplification.  The  nodes  of  the  system  may  be  locally  concentrated  or  spatially  dispersed  as  a 
local  network.  Applications  range  from  data  base-oriented  transactional  systems  to  "number  crunching."  The 
system  is  service-oriented;  that  is,  it  appears  to  the  user  as  one  computer  on  which  parallel  processing  takes  place 
in  the  form  of  cooperating  processes.  Cooperation  is  regulated  by  the  unique  interprocess  communication  (IPC) 
protocol  presented  in  this  paper.  The  high-level  protocol  is  based  on  the  consumer/producer  model  and  satisfies 
all  requirements  for  such  a  distributed  multicomputer  system.  It  is  demonstrated  thai  the  protocol  lends  itself 
toward  a  straightforward  mechanization  by  dedicated  hardware  consisting  of  a  cooperation  handler,  an  address 
transformation  and  memory  guard  unit,  and  bus  connection  logic.  These  special  hardware  resources,  assisted  by 
a  "local  operating  system",  form  the  supervisor  of  a  node.  Nodes  are  connected  by  a  high-speed  bus  (280 
Mbit/sec).  Programming  aspects  as  implied  by  the  protocol  are  also  described. 

[Green  80]  Green  Jr.,  Paul  E. 

An  Introduction  to  Network  Architectures  and  Protocols. 

IEEE  Transactions  on  Communications  COM-28(4);41 3-424,  April,  1980. 

Abstract 

This  tutorial  paper  is  intended  for  the  reader  who  is  unfamiliar  with  computer  networks,  to  prepare  nun  tor  eadmg 
the  more  detailed  technical  literature  on  the  subject.  The  approach  liere  is  to  start  with  an  ordered  list  of  the 
functions  that  any  network  must  provide  intieing  two  end  users  together,  and  then  to  indicate  how  this  leads 
naturally  to  layered  peer  protocols  out  of  which  the  architecture  of  a  computer  network  s  constructed  Arret  a 


discussion  o t  a  few  block  diagrams  of  private  (commercially  provided)  and  public  (common  carrier)  networks,  the 
layer  and  header  structures  of  SNA  and  DNA  architectures  and  the  X.2S  interface  are  briefly  described. 

[Guillemont  82]  Guillemont,  Marc. 

The  Chorus  Distributed  Operating  System:  Design  and  Implementation. 

In  Proceedings,  International  Symposium  on  Local  Computer  Networks,  Institut 
National  de  Rechereche  en  Informatique  et  en  Automatique  (INRIA),  April, 

1982. 

Abstract 

CHORUS  is  an  architecture  for  distributed  systems.  It  includes  a  method  tor  its  execution  and  the  (operating) 
system  to  support  this  execution.  One  important  characteristic  of  CHORUS  is  that  the  major  part  of  the  system  is 
built  with  the  same  architecture  as  applications.  In  particular,  the  exchange  of  messages,  which  is  the 
fundamental  communication/synchronization  mechanism,  has  been  extended  to  the  most  basic  functions  of  the. 
system. 

[Guillier  80]  Guiltier,  P.  and  D.  Slosberg. 

An  Architecture  with  Comprehensive  Facilities  of  Inter- Process  Synchronization 
and  Communication. 

In  Proceedings,  Seventh  Annual  Symposium  on  Computer  Architecture,  pages 
264-270.  IEEE  and  ACM,  1980. 

Abstract 

In  the  architecture  of  the  "Level  64"  manufactured  by  Cll-Honeywell-Bull  and  Honeywell  Information  Systems, 
processes  executing  in  a  central  processor  are  known  to  the  hardware-firmware.  They  use  the  same  semaphore 
mechanism  as  processes  executing  m  an  input-output  controller.  This  implies  specific  data  structures  recognized 
by  the  hardware-firmware  and  a  hardware -firmware  dispatching  of  the  central  processor  resource.  Experience  in 
this  domain  has  led  to  the  development  of  some  new  extensions. 

[Halsall  78]  Halsall,  F.  and  A.  E.  Fenesan. 

Software  Aspects  of  a  Closely  Coupled  Multicomputer  System. 

Computers  and  Digital  Techniques  1(1):21-26,  February,  1978. 

Abstract 

This  paper  describes  the  philosophy  and  Structure  of  the  operating-system  software  which  is  currently  being 
developed  for  a  closely  coupled  multicomputer  system.  The  proposed  operating  system  s  effectively  distributed 
between  the  individual  computing  elements  of  the  system.  Each  computing  element  or  module  contains  a  copy  of 
a  simple  operating  system  or  nucleus  which  has  been  designed  on  the  one  hand  to  provide  a  standard  software 
interface  tor  the  applications  software  within  the  module  and  on  the  other  to  form  an  interface  with  other  modules 
through  the  intercomputer-communication  facility.  A  necessary  and  sufficient  condition  for  a  computing  module 
to  function  m  the  proposed  system  is  the  possession  of  a  copy  of  this  nucleus.  The  nucleus  software  has  been 
implemented  in  a  high-level  procedure-based  language  and  a  designed  to  provide  the  applications  programmer 
with  a  basic  set  of  commands  or  primitives  which  facilitate  the  creation  and  control  of  the  other  application 
processes  within  the  same  module  and  the  sending  and  receiving  of  messages  to  and  from  application  processes 
resident  within  other  mcdules.  The  paper  also  includes  details  of  the  size  and  performance  of  the  implemented 
system. 

[Halstead  78]  Halstead  Jr.,  Robert  H. 

Multiple-Processor  Implementations  of  Message-Passing  Systems. 

PhD  thesis,  Laboratory  for  Computer  Science,  Massachusetts  Institute  of 
Technology,  January,  1978. 

Abstract 

The  goal  of  this  thesis  is  to  develop  a  methodology  for  building  networks  of  small  computers  capable  of  the  same 
tasks  now  performed  by  single  larger  computers  Such  networks  promise  to  be  both  easier  to  scale  and  more 
economical  in  many  instances. 

The  mu  calculus,  a  simple  syntactic  formalism  lor  representing  message- passing  computations,  is  piesented  and 
augmented  to  serve  as  the  semantic  basis  tor  programs  running  on  ttie  network  The  augmented  version  includes 
cells,  tokens,  and  semaphores,  allow  certain  simple  communications  and  synchronization  task'  without  involving 
fully  general  side  effects. 

The  network  implementation  presented  supports  obiect  references,  keeping  track  of  them  by  using  a  new  concept. 
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the  reference  tree.  A  reference  tree  is  a  group  ol  neighboring  processors  in  the  network  that  share  knowledge  of  a 
common  obiect.  Also  discussed  are  mechanisms  for  handling  side  effects  on  obtects  and  strategy  issues  involved 
in  allocating  computations  to  processors. 

[Hammond  80]  Hammond,  Richard  A. 

Experiences  with  the  Series/ 1  Distributed  System. 

In  Digest  of  Papers,  COMPCON  80  Fall,  pages  585-589.  IEEE,  1980. 

Abstract 

The  Series/ 1  Distributed  System  (SODS),  developed  at  the  University  of  Delaware,  is  an  experimental  system  for 
research  in  distributed  computing,  it  consists  of  several  IBM  Series/ 1  computers,  a  local  communications 
network,  an  operating  system  (SODS/OS),  a  file  system  (SODS/FS),  and  applications  software.  Experience  in 
designing,  implementing,  and  using  the  system  has  given  insight  into  the  basic  strengths  and  weaknesses  of  its 
design. 

[Hartenstein  ??]  Hartenstein,  Reiner  W.,  Werner  Konrad  and  Anton  Sauer. 

A  Loosely  Coupled  Multi-Microstructure  as  a  Tool  for  Software  Development, 
[unknown]. 

Abstract 

The  communication  tools  of  message-oriented  operating  systems  and  related  scientific  methods  and  concepts  can 
be  directly  mapped  into  hardware  constructs.  Thus  loosely  coupled  microcomputer  networks  along  with 
approaches  to  specification  and  design  methods  emerge.  It  is  also  shown  which  network  class  is  best  suited  to 
support  the  reliability  of  the  software  and  the  whole  system.  After  in  introducing  survey  about  the  basics  a  realized 
circuit  concept  as  an  application  is  described  together  with  program  development  aids  for  message-coupled 
dedicated  microcomputer  networks.  With  the  help  of  this  tool  special  applications  can  be  implemented  by  means 
of  "distributed  programming"  basing  on  "hardware  capsulated  software  modules".  The  outlined  kit  system 
especially  supports  the  reliability  of  software. 

[Herlihy  80]  Herlihy,  Maurice  and  Barbara  Liskov. 

Communicating  Abstract  Values  in  Messages. 

Technical  Report  200,  Massachusetts  Institute  of  Technology,  October,  1980. 
Computation  Structures  Group  Memo. 

Abstract 

Abstract  data  types  have  proved  to  be  a  useful  technique  for  structuring  systems.  In  large  systems,  however,  it  is 
sometimes  useful  to  have  different  regions  of  the  system  use  different  representations  for  the  abstract  data  values. 
This  paper  describes  a  technique  tor  communicating  abstract  values  between  such  regions.  The  method  was 
developed  for  use  in  constructing  distributed  systems,  where  the  regions  exist  at  different  computers,  and  the 
values  are  communicated  over  the  network.  As  such,  the  method  defines  a  call-by-value  semantics.  The  method 
is  also  useful  in  non-distributed  systems  wherever  call-by-value  is  the  desired  semantics.  An  important  example  of 
such  a  use  is  a  repository,  such  as  a  file  system,  for  storing  long-lived  data. 

[Hertweck  78]  Hertweck,  F.,  E.  Raubold  and  F.  Vogt. 

X.25  Based  Process/Process  Communication. 

In  Proceedings,  Computer  Network  Protocols,  pages  C3-1  -  C3-22.  Universite'  De 
Liege,  February,  1978. 

Abstract 

This  paper  describes  an  end-to-end  protocol  lor  interprocess  communication  based  on  the  X.25  virtual  channel 
protocol.  The  main  (software)  device  to  couple  the  operating  system  ol  a  host  computer  to  a  communication 
network  is  the  "Message  Transmission  Controller".  Its  structure  and  its  principal  functions  are  described.  Special 
consideration  is  given  to  implementability  on  present  day  computer  systems  including  pure  host  or  simple  terminal 
MTCs.  but  also  host/front-end  configurations.  The  problem  process/process  communication  is  mapped  onto  an 
interface  process/MTC  by  defining  a  set  of  suitable  interface  commands  to  be  executed  by  the  process.  The  step 
to  higher  level  (or  application)  protocols  is  done  on  the  basis  ol  the  Communication  Variable  concept. 

[Hewitt  77]  Hewitt,  Carl  and  Henry  Baker. 

Laws  for  Communicating  Parallel  Processes. 

In  Proceedings,  Information  Processing  77,  pages  339-344.  IFIP,  1977. 

Abstract 

This  paper  presents  some  laws  that  must  be  satisfied  by  computations  involving  communicating  parallel 


processes.  The  laws  are  stated  in  the  context  ot  the  actor  theory,  a  model  tor  distributed  parallel  computation,  and 
take  the  torm  of  stating  plausible  restrictions  on  the  histories  ot  parallel  computations  to  make  them  physically 
realizable.  The  laws  are  justified  by  appeal  to  physical  intuition  and  are  to  be  regarded  as  faisifiable  assertions 
about  the  kinds  of  computations  that  occur  in  nature  rather  than  as  proven  theorems  >n  mathematics.  The  laws  are 
used  to  analyze  the  mechanisms  by  which  multiple  processes  can  communicate  to  work  effectively  together  to 
solve  difficult  problems. 

Since  the  causal  relations  among  the  events  in  a  parallel  computation  do  not  specify  a  total  order  on  events,  the 
actor  model  generalizes  the  notion  of  computation  from  a  sequence  of  states  to  a  partial  order  of  events.  The 
interpretation  of  unordered  events  in  this  partial  order  is  that  they  proceed  concurrently.  The  utility  of  partial 
orders  is  demonstrated  by  using  them  to  express  our  laws  for  distributed  computation. 

[Huen  77]  Huen,  Wing,  Peter  Greene,  Ronald  Hochsprung,  Ossama  El-Dessouki. 

A  Network  Computer  for  Distributed  Processing. 

In  Digest  of  Papers,  COMPCON  77  Fall,  pages  326-330.  IEEE,  1977. 

Abstract 

The  TECHNEC  is  a  network  computer  in  the  form  of  a  ring  of  microcomputers  (LSI-1  Is),  designed  for  research  in 
distributed  processing.  The  design  objectives,  architecture  and  software  support  of  the  system  are  presented. 
Major  user  requirements  such  as  pipelined  compiling,  automatic  partitioning,  and  distributed  control  of  machine 
intelligence  applications  are  considered. 

[Hunt  79]  Hunt,  J.  G. 

Messages  in  Typed  Languages. 

ACM-SIGPLAN  Notices  14(1):27-45, 1979. 

Abstract 

Messages  are  increasingly  being  used  for  interprocess  communication.  The  problem  of  introducing  messages  into 
typed  languages  is  considered,  and  a  solution  in  terms  of  typed  message-channels  is  presented.  Our  particular 
treatment  permits  dynamic  connections,  including  secure  linking  of  separately-compiled  programmes,  and  also 
features  nondeterminacy,  thereby  enabling  automatic  resource-scheduling  without  monitors.  Implementation 
considerations  are  discussed,  and  a  comparison  with  the  work  of  other  authors  is  given. 

[Hunt  80]  Hunt,  V.  Bruce  and  Pier  Carlo  Ravasio. 

Olivetti  Local  Network  System  Protocol  Architecture. 

In  Proceedings,  IFIP  Working  Group  6.4  International  Workshop  on  Local  Networks 
for  Computer  Communications,  pages  223-244.  IBM,  August,  1980. 

Abstract 

We  describe  the  Olivetti  Local  Network  System  protocol  architecture,  which  is  a  component  of  the  Olivetti  Local 
Network  System  architecture.  The  ISO  open  system  interconnection  reference  architecture  was  used  as  a 
reference  guide  for  the  architecture.  Our  architecture  provides  the  principal  structures,  attributes  and  component 
interfaces  of  the  communication  system  to  guide  design  and  implementation  of  specific  protocols.  Fundamental 
mechanisms  employed  include  a  communication  model,  layering,  and  functional  division.  The  communication 
model  is  based  on  an  abstract  communication  primitive  called  a  channel.  The  model  is  applicable  at  all  levels  in 
the  hierarchy  of  layers.  The  architecture  defines  six  layers  including  physical  link,  data  link,  transport,  session, 
presentation,  and  application  layers.  Functional  divisions  specified  are  locator,  transport,  synchronization,  error 
management ,  control  and  monitoring.  Functional  division  is  applied  uniformly  to  each  layer  to  achieve  a  coherent 
overall  structure.  Issues  such  as  performance,  name  recognition,  and  flow  control  arising  from  the  architecture's 
structure  and  associated  implementation  are  discussed. 

[Jacquemart  78]  Jacquemart,  Yves  A. 

Network  Interprocess  Communication  in  an  X.25  Environment. 

In  Proceedings,  Computer  Network  Protocols,  pages  Cl-1  ••  Cl -6.  Universite’  De 
Liege,  February,  1978. 

Abstract 

In  this  article  we  first  define  an  interprocess  communication  facility  in  terms  of  service,  access  and  function.  After 
clarification  of  the  X.25  principles  we  intend  to  use  X.25  as  the  transmission  service  of  the  interprocess 
communication  facility.  Wo  compare  the  X.25  service  with  other  transmission  services  and  we  conclude  by  saying 
that  a  datagram  transmission  service  beside  X.25  is  necessary. 
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[Jammel  80]  Jammel,  Alfons  J.,  Pavel  A.  Vogel  and  Helmut  G.  Stiegler. 

Impacts  of  Message  Orientation. 

In  Proceedings,  IFIP  Congress  80,  pages  281-286.  IFIP,  October,  1980. 

Abstract 

Two  different  basic  operating  system  structures,  procedure-  and  message-oriented,  have  been  recognized. 
Strictly  message-oriented  systems  are  very  rare,  their  direct  distributability.  however,  strongly  suggests  that 
greater  attention  should  be  paid  to  them.  Referring  to  the  operating  system  BSM  we  illustrate  and  discuss  the 
impact  of  a  strictly  message-oriented  structure  on  parallelism,  synchronization,  protection,  distributability.  error 
recovery,  and  efficiency.  No  inherent  handicaps  due  to  message  orientation  have  been  encountered.  Practical 
experience  with  BSM  has  led  to  the  concept  of  manager  processes.  This  concept  seems  to  have  contributed  some 
new  aspects  to  operating  system  design  by  making  non-sequential  processes  manageable. 

[Jazayeri  80]  Jazayeri,  Mehdi. 

CSP/80:  A  Language  for  Communicating  Sequential  Processes. 

In  Digest  of  Papers,  COMPCON  80  Fall,  pages  736-740.  IEEE,  1980. 

Abstract 

CSP/80  is  a  programming  language  intended  for  distributed  applications.  It  is  based  on  Hoare's  communicating 
sequential  processes  (CSP).  We  discuss  those  aspects  of  CSP/80  that  differ  significantly  from  CSP  and  give  the 
reasons  why  these  departures  from  CSP  were  necessary. 

[Jensen  78]  Jensen,  E.  Douglas. 

The  Honeywell  Experimental  Distributed  Processor- An  Overview. 

Computer  1 1  (1  ):1 37- 147,  January,  1978. 

Abstract 

The  Honeywell  Experimental  Distributed  Processor  (HXDP)  is  a  vehicle  for  research  in  the  science  and 
engineering  of  processor  interconnection,  executive  control,  and  user  software  for  a  certain  class  of  multiple- 
processor  computers  which  we  call  'distributed  computer"  systems.  Such  systems  are  very  unconventional  in 
that  they  accomplish  total  system-wide  executive  control  in  the  absence  of  any  centralized  procedure,  data,  or 
hardware.  The  primary  benefits  sought  by  this  research  are  improvements  over  more  conventional  architectures 
(such  as  multi-processors  and  computer  networks)  in  extensibility,  integrity,  and  performance.  A  fundamental 
thesis  of  the  HXDP  protect  is  that  the  benefits  and  cost-effectiveness  of  distributed  computer  systems  depend  on 
the  judicious  use  of  hardware  to  control  software  costs.  In  this  paper  we  describe  the  class  of  computer  systems 
of  interest  to  the  HXDP  project,  the  motivations  for  our  interest,  our  research  approach  the  initial  application 
environment,  the  HXDP  system  philosophy,  and  the  HXDP  hardware  facilities  as  seen  by  the  executive 
programmer.  The  software  portion  of  the  executive  will  be  described  in  a  subsequent  paper., 

[Johnson  75]  Johnson,  Paul  R.,  Richard  E.  Schantz  and  Robert  H.  Thomas. 

Interprocess  Communication  to  Support  Distributed  Computing. 

In  Proceedings,  SIGCOMM-SIGOPS  Interface  Workshop  on  Interprocess 
Communications,  pages  199-203.  ACM,  March,  1975. 

Abstract 

A  distributed  computing  system  Is,  by  definition,  dependent  upon  communication  between  the  distributed 
elements  for  its  existence.  It  has  become  common  to  refer  to  each  instance  of  parallel  activity  in  a  computer 
system  as  a  process.  Therefore,  what  is  known  as  interprocess  communication  (IPC)  is  the  lifeline  or  essential 
building  block  for  any  distributed  computing  facility.  In  the  narrow  sense,  our  concern  with  IPC  is  with  the 
characteristics  of  a  mechanism  and  interface  which  permit  reliable  communication  of  data  between  processes.  In 
a  much  broader  sense.  IPC  involves  not  only  the  facility  tor  transmitting  data,  but  also  such  questions  as  what  gets 
transmitted  and  to  whom,  when  it  gets  transmitted,  what  form  it  takes,  and  how  it  is  used. 

[Jones  79]  Jones,  Anita  K.,  Robert  J.  Chansler  Jr.,  Ivor  Durham,  Karsten  Schwans  and  Steven 
R.  Vegdahl. 

StarOS,  a  Multiprocessor  Operating  System  for  the  Support  of  Task  Forces. 

In  Proceedings,  Seventh  Symposium  on  Operating  Systems  Principles,  pages 
117-127.  ACM,  December,  1979. 

Abstract 

StarOS  is  a  message -based,  object-oriented,  multiprocessor  operating  system,  specifically  designed  to  support 
task  forces,  large  collections  of  concurrently  executing  processes  that  cooperate  to  accomplish  a  single  purpose. 


StarOS  has  been  implemented  at  Camegie-Metlon  University  lor  the  SO  processor  Cm*  mu  tit-microprocessor 
computer.  In  this  paper,  we  first  discuss  the  attributes  of  task  force  software  and  of  the  Cm*  architecture.  We 
then  discuss  some  of  the  facilities  in  StarOS  that  allow  development  and  experimentation  with  task  forces.  StarOS 
itself  is  presented  as  an  example  task  force. 

[Joseph  81  ]  Joseph,  Mathai. 

Schemes  for  Communication. 

Technical  Report  122,  Department  of  Computer  Science,  Camegie-Mellon 
University,  June,  1981. 

Abstract 

This  report  describes  features  of  a  language  for  distributed  and  parallel  programming  which  has  been  designed  to 
provide  flexibility  in  the  transfer  of  nformation  and  control  between  the  individual  components  of  a  program.  The 
language  allows  synchronous  and  asynchronous  message-passing,  multiple-source  input  and  broadcast  output, 
and  enables  particular  features  of  a  distributed  architecture  to  be  efficiently  accommodated  without  modification 
to  the  language.  The  module  serves  as  the  unit  of  encapsulation  and  a  single  communication  takes  place  between 
an  output  port  in  one  module  and  a  set  of  input  pons  in  other  modules:  each  port  has  a  control  rule  which  specifies 
the  protocol  for  sending  or  receiving  messages,  and  is  associated  with  a  particular  communication  scheme  which 
implements  die  communication  operations.  Modules  are  assumed  to  execute  independently  of  each  other  except 
when  they  communicate  by  sending  messages:  the  lifetime  of  a  module  is  therefore  limited  only  by  its  ability  to 
send  and  receive  messages.  The  use  of  the  distinctive  features  of  the  language,  such  as  broadcast  mode  output, 
is  illustrated  with  several  examples. 

[Kain  76]  Kain,  Richard  V. 

Seven  Dimensions  of  Message  Transmission  Protocols. 

[Unpublished  Document]. 

Abstract 

Message  transmission  protocols  differ  according  to  1)  whether  or  not  the  sender  waits  for  an  acknowledgement,  2) 
how  the  sender  addresses  the  message.  3)  how  the  receiver  detects  that  a  message  exists.  4)  whether  the  receiver 
selects  the  messages.  5)  how  the  receiver  identifies  the  sender,  6)  how  the  receiver  identifier  the  message,  and  7) 
who  determines  the  message  lifetime.  In  each  dimension  the  various  options  have  different  advantages.  The 
choice  of  an  option  determines  the  kinds  of  errors  that  can  cause  non-functionality. 

[Kain  80]  Kain,  Richard  Y.  and  William  R.  Franta. 

Interprocess  Communication  Schemes  Supporting  System  Reconfiguration. 

In  Proceedings,  Computer  Software  and  Application  Conference,  pages  365-371 . 
IEEE,  October,  1980. 

Abstract 

Reliablitity  in  modular  computer  systems  can  be  improved  by  redundancy.  At  the  process  level,  this  requires  either 
the  creation  of  standby  processes  and  communications  interconnections,  or  the  provision  of  dynamic  recovery 
mechanisms.  This  paper  discusses  the  general  reconfiguration  problem,  suggests  four  system  designs  for  dealing 
with  the  problem,  and  presents  evaluations  of  each  in  terms  of  modularity  and  reliability. 

[Kieburtz  81  ]  Kieburtz,  Richard  B. 

A  Distributing  Operating  System  for  the  Stony  Brook  Multicomputer. 

In  Proceedings,  Second  International  Conference  on  Distributed  Computing 
Systems,  pages  67-79.  IEEE,  1981. 

Abstract 

The  Stony  Brook  Multicomputer  is  hierarchially  organized  network  of  computer  nodes  that  has  been  designed  to 
support  problem-solving  by  decomposition.  High  performance,  relative  to  the  speed  of  its  individual  processors,  is 
one  of  its  primary  design  goals.  This  paper  describes  the  design  of  a  message-based,  distributed,  operating 
system  nucleus  for  the  network.  The  nucleus  of  an  operating  system  provides  an  interface  between  a  physical 
machine  and  higher  levels  of  software  that  implement  abstract  resources  to  be  used  by  applications  programs. 
Thus  it  is  strongly  influenced  by  the  hardware  architecture  of  a  system.  The  design  philosophy  is  to  create  levels 
of  abstract  machines,  and  to  embed  the  necessary  communication  protocols  into  these  abstract  machines.  The 
system  supports  a  hierarchy  of  distributed  file  systems,  with  capability -based  protection. 


[Knight  81  ] 


Knight,  Jeremy  and  Marty  Itzkowitz. 

THC  ••  A  Simple  High-Performance  Local  Network. 

In  Proceedings,  Second  International  Conference  on  Distributed  Computing 
Systems,  pages  354-359.  IEEE,  April,  1981. 

Abstract 

We  describe  our  need  (or  a  local  network  and  the  reasons  we  chose  HYPERchannel  as  the  hardware  with  which  to 
implement  it.  We  then  present  our  reasons  for  choosing  interprocess  communication  as  the  principle  service  of 
THC  (The  HYPERchannel  Connection)  and  the  design  choices  made  in  specifying  the  network.  We  then  describe 
the  structure  and  operation  of  the  network.  We  then  go  on  to  describe  the  pseudocode  technique  used  to 
complete  the  design  and  we  briefly  discuss  the  specific  implementations  for  the  various  systems  in  our  network. 
Finally  we  give  performance  measurements  for  the  actual  implemenation  and  present  our  conclusions. 

[Knott  74]  Knott,  Gary  0. 

A  Proposal  for  Certain  Process  Management  and  Intercommunication  Primitives. 
SIGOPS-Operating  Systems  Review  8(4),  October,  1974. 

Abstract 

The  notation  of  a  process,  and  with  it  the  possibilities  for  process  intercommunication  are  fundamental  in  modem 
operating  system  design  and,  in  disguised  form,  in  porposals  for  languages  which  admit  asynchrony.  A 
straightforward  repertoire  of  process  controal  and  process  intercommunication  primitives  are  proposed  and 
illustrated  below.  These  primitives  are  interrupt-based.  The  general  approach  is  founded  upon  the  work  of  Brinch 
Hansen  [B14],  Walden  [Wl],  and  Bernstein  et  al  (B7). 

To  begin,  we  shall  elaborate  on  the  notion  of  a  process  and  sketch  a  process-processor  relation  to  be  used  as  a 
model.  We  note  briefly  that  other  operating  system  issues  must  be  kept  in  mind.  The  various  primitives  are  then 
described  in  detail,  and  following  this,  they  are  illustrated  and  compared  with  other  proposals. 

[Koch  82]  Koch,  A.  and  T.  S.  E.  Maibaum. 

A  Message  Oriented  Language  for  System  Applications. 

In  Proceedings,  Third  International  Conference  on  Distributed  Computing 
Systems,  IEEE,  ??,  1982. 

[draft,  maybe  not  accepted]. 

Abstract 

The  report  outlines  the  design  of  an  architecture  independent  programming  language  which  takes  advantage  of 
the  features  of  distributed  computer  architectures.  To  reflect  the  acceptance  of  the  use  of  abstract  data  types  in 
both  the  programming  process  and  in  language  design,  the  language  incorporates  a  mechanism  for  their 
implementation.  This  construct  allows  a  programmer  to  write  programs  which  use  the  operations  of  the  type  in 
parallel  to  any  degree  supported  by  the  abstract  properties  of  the  type.  The  language  also  incorporates  a 
mechanism  lor  the  "active"  components  of  programs  with  the  programmer  being  encouraged  to  regard  this 
construct  as  a  collection  of  functions  (as  opposed  to  the  collection  of  operations  for  a  data  type).  Powerful 
message  passing  mechanisms  are  incorporated  into  the  language  to  provide  a  strictly  typed,  asynchronous 
mechanism  for  communication.  Although  we  do  not  outline  the  ideas  here,  the  language  is  supported  by  powerful 
design  and  analysis  techniques. 

[Kramer  81  ]  Kramer.  J.,  H.  Magee  and  M.  Sloman. 

Intertask  Communication  Primitives  for  Distributed  Computer  Control  Systems. 

In  Proceedings,  Second  International  Conference  on  Distributed  Computer 
Systems,  pages  404-411.  IEEE,  April,  1981. 

Abstract 

This  paper  concentrates  on  the  study  of  intertask  communication  primitives  suitable  for  a  distributed  process 
control  environment.  The  communication  requirements  are  identified  in  terms  of  process  control  applications. 
The  requirements  for  task  behaviour,  robustness  and  response  time  are  described  with  respect  to  these 
transactions  Existing  proposals  tor  communication  primitives  are  examined  and  found  to  be  wanting.  Finally,  a 
set  of  primitives  are  proposed  which  match  the  requirements  more  satisfactorily  than  existing  proposals. 

[Lantz  80]  Lantz,  Keith  A. 

RIG,  An  Architecture  for  Distributed  Systems. 

In  Proceedings,  Pacific  '80,  ACM,  November,  1980. 


Abstract 


At  the  University  of  Rochester  we  have  had  six  years  of  experience  in  the  design  and  implementation  of  a 
multiple-machine,  multiple-network  distributed  system  called  RIG.  RIG  was  built  to  serve  as  the  sole  intermediary 
between  the  human  user  (working  through  a  display  terminal  or  personal  computer)  and  his  available  computer 
facilities  As  far  as  possible,  RIG  attempts  to  present  a  coherent  view  of  the  distributed  system  similar  to  that 
provided  by  a  traditional  operating  system  for  a  single  computer.  The  design  of  RIG  is  based  on  a  model  of 
distributed  computation  -  independent  processes  communicating  only  by  messages  -  which  allows  programmers 
to  ignore  the  details  of  network  and  system  configuration.  The  RIG  Virtual  Terminal  Management  System, 
togeather  with  a  consistent  command  interaction  discipline,  allows  the  end-user  to  engage  in  multiple 
simultaneous  activities  and  isolates  him  from  the  idiosyncrasies  of  each  individual  activity.  This  paper  presents  an 
overview  of  RIG.  discusses  some  of  its  major  successes,  and  suggests  avenues  for  future  research. 

[Lauer  79]  Lauer,  Hugh  C.  and  Roger  M.  Needham. 

On  the  Duality  of  Operating  System  Structures. 

SIGOPS-Operating  Systems  Review  13(2):3-19,  April,  1979. 

Abstract 

Many  operating  system  designs  can  be  placed  into  one  of  two  very  rough  categories,  depending  upon  how  they 
implement  and  use  the  notions  of  process  and  synchronization.  One  category,  the  "Message-Oriented  System,”  is 
characterized  by  a  relatively  small,  static  number  of  processes  with  an  explicit  message  system  for  communicating 
among  them.  The  other  category,  the  "Procedure-Oriented  System,"  is  characterized  by  a  large,  rapidly  changing 
number  of  small  processes  and  a  process  synchronization  mechanism  based  on  shared  data. 

In  this  paper,  it  is  demonstrated  that  these  two  categories  are  duals  of  each  other  and  that  a  system  which  is 
constructed  according  to  one  model  has  a  direct  counterpart  m  the  other.  The  principal  conclusion  is  that  neither 
model  is  inherently  preferable,  and  the  mam  consideration  tor  choosing  between  them  is  the  nature  of  the  machine 
architecture  upon  which  the  system  is  being  built,  not  the  application  which  the  system  will  ultima'ely  support. 

[Le  Lann  77]  Le  Lann,  Gerard. 

Distributed  Systems- -Towards  a  Formal  Approach. 

In  Information  Processing  77,  IFIP,  1977. 

Abstract 

Packet-switching  computer  communication  networks  are  examples  of  distributed  systems.  With  the  large  scale 
emergence  of  mini  and  micro-computers,  it  is  now  possible  to  design  special  or  general  purpose  distributed 
systems.  However,  as  new  problems  have  to  be  solved,  new  techniques  and  algorithms  must  be  devised  to 
operate  such  distributed  systems  in  a  satisfactory  manner.  In  this  paper,  basic  characteristics  of  distributed 
systems  are  analyzed  and  fundamental  principles  and  definitions  are  given  it  is  shown  that  distributed  systems 
are  not  just  simple  extensions  of  monolithic  systems.  Distributed  control  techniques  used  in  some  planned  or 
existing  systems  are  presented.  Finally,  a  formal  approach  to  these  problems  is  illustrated  by  the  study  of  a  mutual 
exclusion  scheme  intended  for  a  distributed  environment 

[Liskov  79]  liskov,  Barbara. 

Primitives  for  Distributed  Computing. 

In  Proceedings,  Seventh  Symposium  on  Operating  Systems  Principles,  pages 
33-42.  ACM,  December,  1979. 

Abstract 

Distributed  programs  that  run  on  nodes  of  a  network  are  now  technologically  feasible,  and  are  well-suited  to  the 
needs  of  organizations.  However,  our  knowledge  about  how  to  construct  such  programs  is  limited.  This  paper 
discusses  primitives  that  suoport  the  construction  ol  distributed  programs.  Attention  is  focused  on  primitives  in 
two  mator  areas:  modularity  and  communication.  The  issues  underlying  the  selection  of  the  primitives  are 
discussed,  especially  the  issue  of  providing  robust  behavior,  and  various  candidates  are  analyzed.  The  primitives 
will  ultimately  be  provided  as  part  of  a  programming  language  that  will  be  used  to  experiment  with  construction  of 
distributed  programs. 

[Liskov  81  ]  Liskov,  Barbara  and  Robert  Scheifler. 

Guardians  and  Actions:  Linguistic  Support  for  Robust,  Distributed  Programs. 
Technical  Report  210,  Massachusetts  Institute  of  Technology,  November,  1981. 
Computation  Structures  Group  Memo. 

Abstract 

This  paper  presents  an  overview  of  an  integrated  programming  language  and  system  designed  to  support  the 
construction  and  maintenance  of  distributed  programs  programs  in  which  modules  reside  and  execute  at 
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communicating,  but  geographic  ally  distinct,  nodes.  The  language  is  intended  to  support  a  class  of  applications  in 
which  the  manipulation  and  preservation  of  long-lived,  on-line,  distributed  data  is  important.  Tne  language 
addresses  the  writing  of  robust  programs  that  survive  hardware  failures  without  loss  of  distributed  information  and 
that  provide  highly  concurrent  access  to  that  information  while  preserving  its  consistency.  Several  new  linguistic 
constructs  are  provided;  among  them  are  atomic  actions,  and  modules  called  guardians  that  survive  node  failures. 

[Liu  77]  Liu,  Ming  T.  and  Cecil  C.  Reames. 

Message  Communication  Protocol  and  Operating  System  Design  for  the 
Distributed  Loop  Computer  Network  (DLCN). 

In  Proceedings,  Fourth  Annual  Symposium  on  Computer  Architecture,  March, 

1977. 

Abstract 

The  Distributed  Loop  Computer  Network  (DLCN)  is  envisioned  as  a  powerful,  unified  distributed  computing  system 
which  interconnects  midi/mini/micro-  computers,  terminals  and  other  penpherals  through  careful  integration  of 
hardware,  software  and  a  loop  communication  network.  Research  concerning  DLCN  has  concentrated  on  the 
loop  communication  network,  message  protocol  and  distributed  network  operating  system.  For  the  loop 
communication  network,  previous  papers  [2.3]  reported  a  novel  message  transmission  mechanism,  its  hardware 
implementation,  and  its  superior  performance  verified  by  GPSS  simulation.  This  paper  presents  an  overview  of  the 
design  requirements  and  implementation  techniques  for  DLCN's  message  protocol  and  network  operating  system. 
Firstly,  a  bit-oriented  distributed  message  communication  protocol  (DLMCP)  which  handles  four  message  types 
under  one  common  format  is  proposed.  Besides  user  information  transfer,  this  protocol  supports  automatic 
hardware-generated  message  acknowledgment,  error  detection  and  recover,  and  network  control  and  distributed 
operating  system  functions.  Secondly,  the  network  operating  system  (DLOS)  is  described  which  provides  facilities 
for  interprocess  communication  by  process  name,  global  process  control  and  calling  of  remote  programs, 
generalized  data  transfer,  alterable  multi-linked  process  control  structures,  distributed  resource  management,  and 
logical  I/O  transmission  in  a  distributed  file  system. 

[Liu  81]  Liu,  Ming  T.,  Duen-Ping  Tsay,  Chuen-Pu  Chou  and  Chun-Ming  Li. 

-  Design  of  the  Distributed  Double-Loop  Computer  Network  (DDLCN). 

Journal  ol  Digital  Systems  4(4),  March,  1981. 

Abstract 

This  paper  presents  the  sytem  design  at  the  Distributed  Double-Loop  Computer  Network  (DDLCN),  which  is  a 
fault-tolerant  distributed  processing  system,  that  interconnects  midi.  mini,  and  micro  computers  using  a  double- 
loop  structure.  Several  new  features  and  innovative  concepts  have  been  integrated  into  the  hardware, 
communications,  software,  and  applications  of  DOCLN.  The  interface  design  is  unique  in  that  it  employs  tri-state 
control  logic  and  bit-sliced  processing,  thereby  enabling  the  network  to  become  dynamically  reconfigurable  and 
fault-tolerant' with  respect  to  communication  link  failure  as  well  os  component  failure  in  the  interface.  Three 
classes  of  N-process  communication  protocols,  each  providing  a  different  degree  of  reliability,  have  been 
developed  for  exchanging  multi-destination  messages.  Two  synchronization  mechanisms,  eventcounts  and 
sequencers  (low-level)  and  control  abstraction  (high-level),  are  provided  for  use  in  distributed  process 
synchronization.  A  new  concurrency  control  mechanism,  which  uses  distributed  control  without  global  locking 
and  is  deadlock-free,  has  been  developed  for  use  in  distributed  database  systems.  Finally,  a  distributed 
programming  language  called  DISLANG  has  been  proposed  for  use  m  implementing  rieurihnterl  systems  software. 
The  language  uses  a  new  concept,  called  Communicating  Distributed  Processes  (COP),  to  provide  programmers 
with  capabilities  to  handle  specific  problems  in  distributed  computing  environments,  such  as  global  operations, 
communication  delay  and  failure,  N-process  communication,  etc. 

[Livesey  79]  Uvesey,  Jon. 

Inter-Process  Communication  and  Naming  in  the  Mininet  System. 

In  Digest  of  Papers,  Eighteenth  IEEE  Computer  Society  International 
Conference,  pages  222-229.  IEEE,  February,  1979. 

Abstract 

We  present  a  distributed  message  switched  operating  system,  Mininet,  in  which  tnter-process  communication  is 
separated  from  object  naming  and  protection. 

All  objects  in  the  system  are  abstracted  as  executable  objects,  tasks,  and  inter -process  communication  is  carried 
out  between  tasks  without  implying  any  particular  method  of  object  protection  or  naming.  Naming  and  protection 
policies  and  mechanisms  are  implemented  above  tho  interprocess  communication,  and  can  be  changed  without 
changing  it  In  order  to  substantiate  this,  we  present  a  particular  model  of  resource  naming  and  protection  which 
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seems  to  lutfill  the  need  to  distribute  resource  access  across  the  system,  avoiding  the  need  tor  centralized  system 
control. 

[Lorin  80]  Lorin,  Harold  and  Barry  C.  Goldstein. 

Operating  System  Structures  for  Polymorphic  Hardware. 

Technical  Report  35518,  IBM,  March,  1980. 

Abstract 

Given  the  technology  that  laces  us,  it  is  not  unreasonable  to  project  the  existence  of  multiple  processor 
configurations  which  have  large  numbers  of  processors  with  a  variety  of  memory  sharing  and  functional  allocation 
possibilities. 

This  paper  addresses  some  problems  in  the  structure  of  operating  systems  that  will  manage  such  configurations 
so  as  to  minimize  the  systems  interference  with  application  progress  and  provide  for  effective  reaction  to  changing 
demands  on  the  system.  The  structure  of  process  creation  and  inter  process  communication  is  explored  in  the 
context  of  two  possible  software/ hardware  structures. 

[Manning  75]  Manning,  Eric  and  Richard  W.  Peebles. 

Segment  Transfer  Protocols  for  a  Homogeneous  Computer  Network. 

In  Proceedings.  SIGCOMM-SIGOPS  Interface  Workshop  on  Interprocess 
Communications,  pages  170-178.  ACM,  March,  1975. 

Abstract 

This  research  is  focussed  on  solving  certain  problems  of  distributed  processing  on  a  distributed  data  base,  with 
emphasis  on  transaction  processing.  Many  data  bases  exhibit  oeooraohic  locality  gt  reference:  most  of  the 
transaction  homing  on  a  given  component  of  the  data  base  originate  from  a  particular  geographic  region.  At  the 
same  time  there  is  a  need  to  operate  the  collection  of  components  as  a  single  data  base,  to  provide  for  occasional 
transactions  which  cross  regional  boundaries,  and  tor  managerial  queries  and  mformantion  retrieval  applications 
which  span  the  entire  data  base.  There  are  many  examples  of  this  associated  with  business  and  industry;  credit 
and  inventory  records  for  example.  Finally,  geographic  locality  of  reference  is  only  one  of  the  reasons  for  creating 
logically  unified  but  physically  distributed  data  bases.  If  a  data  base  contains  information  supplied  by  several 
agencies,  each  may  insist  as  a  matter  of  policy  that  'its*  data  be  held  in  'its'  hardware  located  on  'its'  premises, 
quite  apart  from  the  technical  efficiencies  which  may  accrue. 

[Manning  77]  Manning,  Eric  G.  and  Richard  W.  Peebles. 

A  Homogeneous  Network  for  Data-Sharing  Communications. 

Computer  Networks  1(2):21 1-224, 1977. 

Abstract 

The  communications  aspects  of  a  distributed  architecture  for  transaction  processing  are  described.  The 
architecture  is  aimed  at  transaction  processing  on  physically  distributed  data  bases,  where  most  of  the  hits  on  a 
given  component  of  the  data  base  come  from  a  single  geographic  region.  The  architecture  is  physically  based  on 
a  homogeneous  set  of  host  minicomputers,  a  message-switched  communications  subnetwork  (loop  or 
packet-  switched),  and  a  set  of  network  interface  processors  which  connect  the  hosts  to  the  communications 
subnetwork.  It  is  logically  based  on  two  primitives;  all  data  objects  (including  messages)  are  segments  and  all 
control  objects  (including  messages)  are  segments  and  all  control  objects  (including  messages)  are  tasks.  Each 
task  runs  in  a  private  virtual  space  and  all  inter-task  communication  is  done  by  passing  message  segments. 
Segment  passing  is  dona  by  a  single  message-switching  task  in  each  host,  assisted  by  the  interface  processors 
and  communications  subnetwork  where  necessary.  The  message-switching  task  also  enforces  protection  rules 
without  the  need  for  special  hardware. 

A  two-host  implementation  of  the  logical  architecture  is  operational.  It  is  based  on  POP- ft  minicomputers  and  a 
non-switchcd  wire  pair  subnetwork.  The  companion  paper  describes  modelling  studies  of  the  architecture,  using 
simulation  and  queueing-theoretic  techniques. 

[Manning  80]  Manning,  Eric,  Jon  Livesey  and  H.  T okuda. 

Interprocess  Communication  in  Distributed  Systems:  One  View. 

In  Proceedings.  IFIP  Congress  80,  pages  513-520.  IFIP,  October,  1980. 

Abstract 

This  paper  first  describes  the  program  of  experimental  research  in  distributed  systems  which  has  been  carried  out 
in  the  Computer  Communications  Networks  Group  of  the  University  of  Waterloo,  over  tlie  past  six  years.  The  focus 
of  the  paper  is  on  inter-process  communication  (IPC)  techniques,  and  we  therefore  provide  a  comparison  of 


message-switched  I  PC  facilities  in  several  distributed  systems  developed  both  at  Waterloo  and  elsewhere.  The 
points  of  comparison  include  message  management,  synchronization  modes,  and  performance.  We  have  almost 
invariably  chosen  message-switched  I  PC  for  our  distributed  systems,  and  we  examine  the  reasons  for  these 
decisions.  Finally,  we  draw  a  few  conclusions. 

[Mao  80]  Mao,  T.  William  and  Raymond  T.  Yeh. 

Communication  Port:  A  Language  Concept  lor  Concurrent  Programming. 
Transactions  on  Software  Engineering  SE -6(2):  194-204,  March,  1980. 

Abstract 

A  new  language  concept-communication  port  (CP),  is  introduced  for  programming  on  distributed  processor 
networks.  Such  a  network  can  contain  an  arbitrary  number  of  processors  each  with  its  own  private  storage  but 
with  no  memory  sharing.  The  procesors  must  communicate  via  explicit  message  passing.  Communication  port  is 
an  encapsulation  of  two  language  properties:  "communication  nondetermimsm"  and  "communication  disconnect 
time."  it  provides  a  tool  for  programmers  to  wnte  well-  structured,  modular,  and  efficient  concurrent  programs.  A 
number  of  examples  are  given  in  the  paper  to  demonstrate  the  power  of  the  new  concepts. 

[Metcalfe  72]  Metcalfe,  Robert  M. 

Strategies  for  Interprocess  Communication  in  a  Distributed  Computing  System. 

In  Proceedings,  Symposium  on  Computer-Communication  Networks  and 
Teletraffic,  Polytechnic  Institute  of  Brooklyn,  April,  1972. 

Abstract 

A  recurring  problem  in  the  development  of  the  ARPA  Computer  Network  (ARPANET)  is  that  of  organizing  the 
coordination  of  remote  processes.  ARPANET  experience  leads  us  to  suggest  that  there  are  valuable  distinctions 
to  be  made  between:  (1)  distributed  interprocess  communication  as  required  in  computer  network;  and  (2) 
centralized  interprocess  communication  as  often  employed  within  computer  operating  systems.  On  the  basis  of  a 
preliminary  conceptualization,  we  propose  that  good  strategies  for  distributed  interprocess  communication  should 
be  used  more  generally  in  computer  operating  systems  because:  (1)  they  have  a  clarifying  effect  on  the 
management  of  multiprocess  activity-,  and  (2)  they  generalize  well  as  operating  systems  themselves  become  more 
distributed. 

[Miller  81  ]  Miller,  Barton  and  David  Presotto. 

XOS:  An  Operating  System  for  the  X-TREE  Architecture. 

SIGOPS-Operating  Systems  Review  15(2):21-32,  April,  1981. 

Abstract 

This  paper  describes  the  fundamentals  of  the  X-TREE  Operating  System  (XOS),  a  system  developed  to  investigate 
the  effects  of  the  X-TREE  architecture  on  operating  system  design.  It  outlines  the  goals  and  constraints  of  the 
project  and  describes  the  major  features  and  modules  of  XOS.  Two  concepts  are  of  special  interest.  The  first  is 
demand  paging  across  the  network  of  nodes  and  the  second  is  separation  of  the  global  object  space  and  the 
directory  structure  used  to  reference  it  Weaknesses  in  the  model  are  discussed  along  with  directions  for  future 
research. 

[Mills  75]  Mills,  David  L 

The  Basic  Operating  System  for  the  Distributed  Computer  Network. 

Technical  Report  416,  University  of  Maryland,  October,  1975. 

Abstract 

This  report  describes  the  Basic  Operating  System  (BOS)  lor  the  Distributed  Computer  Network  (DCN).  The  BOS  is 
a  multiprogramming  executive  providing  process  and  storage  management,  interprocess  communications, 
input/output  device  control  and  application-program  support.  It  operates  with  any  POP1 1  model  including  at  least 
4K  of  storage,  an  operator's  console  and  a  communication  device  for  connection  to  the  DCN. 

Included  in  this  report  is  a  description  of  the  various  components  that  make  up  the  BOS  and  the  manner  in  which 
they  operate.  Also  described  are  the  various  primitive  functions  and  command  operations  used  to  control  the 
operation  of  the  network  and  the  various  application  programs.  Other  reports,  listed  in  the  references,  describe 
the  functioning  of  the  DCN  as  a  whole  and  also  the  upwardly-compatible  Virtual  Operating  System  (VOS) 
developed  for  POP1 1  models  with  memory  management  features. 

[Mills  76]  Mills,  David  L. 

An  Overview  of  the  Distributed  Computer  Network. 

In  Proceedings,  National  Computer  Conference ,  pages  523-531.  AFIPS,  1976. 
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Abstract 

The  Distributed  Computer  Network  (DON)  is  a  resource-sharing  computer  network  (DON)  which  includes  a  number 
of  DEC  PDP11  computers.  The  DCN  supports  a  number  ol  processes  in  a  multiprogrammed  virtual  environment 
Processes  can  communicate  with  each  other  and  interface  with  this  environment  in  a  manner  which  is 
independent  of  their  residence  within  a  particular  computer.  Resources  such  as  processors,  devices  and  storage 
media  can  be  remotely  accessed  and  shared  so  as  to  provide  increased  reliability,  flexibility  and  system  utilization. 

The  DCN  now  supports  several  programming  languages  and  application  packages.  Programming  languages  such 
as  S1MPL,  LISP,  BASIC  and  others,  along  with  an  extensive  library  of  interactive  graphics  procedures,  can  be 
executed  in  processes  which  take  full  advantage  of  the  distributed  architecture  of  the  network.  Many  of  the 
components  of  the  Disk  Operating  System  (DOS)  for  the  PDP1 1  can  be  executed  in  a  special  emulator-type  virtual 
process  now  being  constructed  for  this  purpose.  In  this  manner  the  PDPl  f  assembler.  FORTRAN  compiler  and 
various  system  utilities  can  be  supported  in  the  network  environment.  In  cases  which  exceed  the  processing 
power  of  the  network,  connections  are  available  to  two  large  Univac  1100-series  machines. 

[Morling  78]  Moriing,  R.  C.  S.,  G.  Neri,  G.  D.  Cain,  E.  Faldella,  T.  Salmon,  D.  J.  Stedham. 

The  MININET  Inter-Node  Control  Protocol. 

In  Proceedings,  Computer  Network  Protocols  ,  pages  B4-1  -•  B4-6.  Universite’  De 
Liege,  February,  1978. 

Abstract 

MININET  is  a  packet-switching  data  transportation  network  being  developed  as  a  solution  to  the  problem  of  local 
area  data  networking,  with  particular  emphasis  on  low-cost  interconnections  for  instrumentation  environments. 
This  paper  describes  a  protocol  for  the  interchange  of  messages  that  relate  to  the  internal  operation  of  the 
network,  and  which  must  be  blended  unobtrusively  into  the  packet  streams  being  transported  between  the  users  of 
the  network.  Details  of  the  differences  between  user  and  network  packet  structures  and  handling  are  presented 
and  it  is  shown  that  adoption  of  a  half  duplex  version  of  a  protocol  developed  earlier,  the  MININET  Link  Protocol, 
preserves  the  essential  simplicity  of  that  protocol  and  satisfies  the  requirements  for  internal  network 

conversations. 

[Morris  72]  Morris,  D.,  G.  R.  Frank  and  T.  J.  Sweeney. 

Communications  in  a  Multi-Computer  System. 

In  Proceedings.  Conference  on  Computers-Systems  and  Technology,  pages 
405-414.  The  Institution  of  Electronic  and  Radio  Engineers,  October,  1972. 

Abstract 

The  MUS  system  being  constructed  at  the  University  of  Manchester  consists  of  several  computers  connected 
together  so  that  they  may  access  each  other's  stores.  The  operating  system  for  the  complex  is  sub-divided  into 
about  16  separate  programs  which  run  independently  except  for  communicating  with  each  other  via  a  formalised 
message-switching  system.  These  programs  are  distributed  across  the  machines  of  the  complex  hence  the 
message-switching  systems  of  the  separate  machines  are  linked.  Within  one  machine  messages  are  transferred  by 
passing  pointers  to  page  tables  rather  than  by  copying  the  information.  Transfers  between  machines  and  copying 
pages  as  necessary. 

[Nelson  80]  Nelson,  Bruce  Jay. 

Remote  Procedure  Call. 

PhD  Thesis  Proposal,  Department  of  Computer  Science,  Carnegie-Mellon 
University. 

Abstract 

Remote  procedure  call  is  the  transfer  of  control  between  programs  in  disjoint  address  spaces  which  share  no 
resources  except  a  narrow  communication  medium. 

This  proposal  first  establishes  a  perspective  for  the  work  and  goals  of  the  thesis.  We  then  define  remote 
procedures  precisely,  outline  the  important  issures  with  an  example,  and  characterize  the  benefits.  We  survey 
past  work  on  remote  procedures-  briefly  commenting  on  its  relationship  to  message-passing  systems--  and 
examine  some  existing  implementations.  These  efforts  are  shown  to  be  weak  when  measured  against  a  spectrum 
of  important  remote  procedure  issues:  strong  typechecking,  parameter  functionality,  binding  and  configuration, 
exactly-once  semantics,  and  error  handling  and  crash  recovery.  We  discuss  these  issues  in  detail  and  propose 
those  which  the  thesis  will  investigate  in  depth.  Some  preliminary  results  on  configuration  are  given. 


[Nelson  81] 


Nelson,  Bruce  Jay. 

Remote  Procedure  Call. 

PhD  thesis,  Department  of  Computer  Science,  Carnegie- Mellon  University,  May, 
1981. 

Abstract 

Remote  procedure  call  is  the  synchronous  language- level  transfer  of  control  between  programs  in  disjoint  address 
spaces  whose  primary  communication  medium  is  a  narrow  channel.  The  thesis  of  this  dissertation  is  that  remote 
procedure  call  (RPC)  is  a  satisfactory  and  efficient  programming  language  primitive  for  constructing  distributed 
systems. 

A  survey  of  existing  remote  procedure  mechanisms  shows  that  past  RPC  efforts  are  weak  m  addressing  the  five 
crucial  issues:  uniform  call  semantics,  binding  and  configuration,  strong  typechecking,  parameter  functionality, 
and  concurrency  and  exception  control.  The  body  of  the  dissertation  elaborates  these  issues  and  defines  a  set  of 
corresponding  essential  properties  for  RPC  mechanisms.  These  properties  must  be  satisfied  by  any  RPC 
mechanism  that  if  fully  and  uniformly  integrated  into  a  programming  language  for  a  homogeneous  distributed 
system.  Uniform  integration  is  necessary  to  meet  the  dissertation's  fundamental  goal  of  syntactic  and  semantic 
transparency  for  local  and  remote  procedure.  Transparency  is  important  so  that  programmers  need  not  concern 
themselves  with  the  physical  distribution  of  their  programs. 

in  addition  to  these  essential  language  properties,  a  number  of  pleasant  properties  are  introduced  that  ease  the 
work  of  distributed  programming.  These  pleasant  properties  are  good  performance,  sound  remote  interface 
design,  atomic  transactions,  respect  for  autonomy,  type  translation,  and  remote  debugging. 

With  the  essential  and  pleasant  properties  broadly  explored,  the  detailed  design  of  an  RPC  mechanism  that  satifiea 
all  of  the  essential  properties  and  the  performance  property  is  presented.  Two  design  approaches  are  used:  The 
first  assumes  full  programming  language  support  and  involves  changes  to  the  language's  compiler  and  binder. 
The  second  involves  no  language  changes,  but  uses  a  separate  translator-  a  source- to-source  RPC  compiler-  to 
implement  the  same  functionality. 

[Ousterhout  79]  Ousterhout,  John  K.,  Donald  A.  Scelza  and  Pradeep  S.  Sindhu. 

Medusa:  An  Experiment  >n  Distributed  Operating  System  Structure  (Summary). 

In  Proceedings,  Seventh  Symposium  on  Operating  Systems  Principles,  pages 
115-1 16.  ACM,  December,  1979. 

Abstract 

The  paper  is  a  discussion  of  the  issues  that  arose  in  the  design  of  an  operating  system  for  a  distributed 
multiprocesor.  Cm*.  Medusa  is  an  attempt  to  understand  the  effect  on  operating  system  structure  of  distributed 
hardware,  and  to  produce  a  system  that  capitalizes  on  and  reflects  the  underlying  architecture.  The  resulting 
system  combines  several  structural  features  that  make  it  unique  among  existing  operating  systems. 

[Ousterhout  80a]  Ousterhout,  John  K. 

Partitioning  and  Cooperation  in  a  Distributed  Multiprocessor  Operating  System: 
Medusa. 

PhD  thesis,  Department  of  Computer  Science,  Carnegie- Mellon  University,  April, 
1980. 

Abstract 

This  dissertation  is  an  analysis  of  the  design  of  Medusa,  an  operating  system  with  a  highly  distributed  control 
structure  that  runs  on  the  Cm*  multimicroprocessor.  In  order  to  gam  an  understanding  of  how  to  exploit 
distributed  hardware,  the  system's  structure  was  allowed  to  derive  directly  from  the  constraints  of  the  underlying 
machine.  The  Cm*  hardware  is  distributed,  yet  extremely  flexible  in  the  kinds  of  mterprocessor  communication  it 
permits.  Thus  Medusa's  structure  arose  from  a  consideration  of  two  issues:  partitioning  and  cooperation.  How 
should  the  system  be  partitioned  in  order  to  enhance  its  modularity  and  make  use  of  the  distributed  hardware  How 
should  the  separate  subunits  communicate  so  as  to  function  together  in  a  robust  way  as  a  single  logical  entity  The 
resulting  system  combines  several  structural  features  that  make  it  unique  among  existing  operating  systems. 

In  order  to  provide  modularity  and  to  capitalize  on  the  distributed  hardware,  Medusa  consists  of  five  relatively 
independent  utilities  that  execute  on  different  processors.  Each  utility  provides  one  abstraction  tor  the  rest  of  the 
system  and  communicates  with  user  programs  and  other  utilities  via  messages.  Functions  are  distributed 
between  utilties  at  a  very  low  lovcf  (for  example,  no  one  utility  contains  enough  functionality  to  create  nrvi  execute 
a  new  program  without  assistance  from  other  utilities).  The  message  communication  mechanism  plays  a  central 
role  in  the  system;  it  is  discussed  in  detail  and  compared  to  other  existing  or  proposed  mechanisms.  The 


distribution  of  the  utilities  presents  a  deadlock  danger.  It  is  shown  how  a  coroutine-based  utility  structure  avoids 
deadlock. 

[Ousterhout  80b]  Ousterhout,  John  K.,  Donald  A.  Scelza  and  Pradeep  S.  Sindhu. 

Medusa:  An  Experiment  in  Distributed  Operating  System  Structure. 

Communications  ol  the  ACM  23(2):92-105,  February,  1980. 

Abstract 

The  design  of  Medusa,  a  distributed  operating  system  lor  the  Cm*  multimicroprocessor,  is  discussed.  The  Cm* 
architecture  combines  distribution  and  sharing  in  a  way  that  strongly  impacts  the  organization  of  operating 
systems.  Medusa  is  an  attempt  to  capitalize  on  the  architectural  features  to  produce  a  system  that  is  modular, 
robust,  and  efficient.  To  provide  modularity  and  to  make  effective  use  of  the  distributed  hardware,  the  operating 
system  is  partitioned  into  several  disjoint  utilities  that  communicate  with  each  other  via  messages.  To  take 
advantage  of  the  parallelism  present  in  Cm*  and  to  provide  robustness,  all  programs,  including  the  utilities,  are 
task  forces  containing  many  concurrent,  cooperating  activities. 

[Panzieri  82]  Panzieri,  F.  and  S.  K.  Shrivastave. 

Reliable  Remote  Calls  for  Distributed  UNIX:  An  Implementation  Study. 

In  Proceedings,  Second  Symposium  on  Reliability  in  Distributed  Software  and 
Database  Systems,  pages  127-133.  July,  1982. 

Abstract 

An  implementation  of  a  reliable  remote  procedure  call  mechanism  tor  obtaining  remote  services  is  described.  The 
reiiablity  issues  are  discussed  together  with  how  they  have  been  dealt  with.  The  performance  of  the  remote  call 
mechanism  is  compared  with  that  of  local  calls.  The  remote  call  mechanism  is  shown  to  be  an  efficient  tool  for 
distributed  programming. 

[Panzieri  ??]  Panzieri,  F.  and  S.  K.  Shrivastava. 

The  Design  of  a  Reliable  Remote  Procedure  Call  Mechanism. 

[The  University  of  Newcastle  upon  Tyne  •  Computing  Laboratory]. 

Abstract 

Starting  from  the  hardware  level  that  provides  pnmative  facilities  for  data  transmission,  we  describe  how  a  reliable 
Remote  Procedure  Call  mechanism  can  be  constructed.  We  discuss  various  design  issues  involved,  these  include 
the  choice  of  a  message  passing  system  over  which  the  remote  procedure  call  mechanism  is  to  be  constructed 
and  the  treatment  of  various  abnormal  situations  such  as  lost  messages  and  node  crashes. 

[Pardo  78]  Pardo,  Roberto,  Ming  T.  Liu  and  Gojko  A.  Babic. 

An  N-Process  Communication  Protocol  for  Distributed  Processing. 

In  Proceedings,  Symposium  on  Computer  Network  Protocols,  pages  13-15.  IEEE, 
February,  1978. 

Abstract 

A  Distributed  Processing  Algorithm  (DPA)  is  an  algorithm  whose  execution  involves  interaction  between  two  or 
more  remote  processes  in  a  distributed  processing  system.  Most  of  the  software  issues  in  distributed  processing 
systems  are  related  to  the  concept  of  DPA's.  One  important  aspect  is  the  message  exchange  (protocol) 
requirements  induced  by  the  DPA's.  Current  high-level  communication  protocols  efficiently  support  the 
establishment  maintenance,  and  termination  of  connections  between  two  processes,  and  thus  can  be  called 
2-process  communication  protocols.  However,  this  class  of  protocols  limits  the  type  of  DPA's  that  can  be 
efficiently  supported  by  a  distributed  processing  system.  In  this  papef  we  propose  a  class  of  protocols  that  are  not 
constrained  to  handle  only  2- process  communication  but  rather  any  "network  of  connections."  and  we  refer  to  a 
protocol  in  this  class  as  an  n-processs  communication  protocol.  The  purpose  of  this  paper  is  to  motivate  the  need 
for  such  protocols,  to  show  their  relationship  with  distributed  processing  systems,  and  to  establish  their  features. 

[Peberdy  ??]  Peberdy,  N.  J. 

Distributed  Computer  Systems  •  A  Model. 

[unknown,  pages  17-25], 

Abstract 

The  past  five  years  have  seen  a  dramatic  changeabout  in  traditional  hardware/software  relationships:  hardware 
costs  have  plummeted,  and  the  size,  environmental  requirements  and  reliability  of  computing  elements  have 
altered  drastically.  It  now  becomes  feasible  to  distribute  a  computing  system,  such  that  processors  may  be  placed 
adjacent  to  the  processes  they  control.  These  distributed  computing  modules  operate  in  an  essentially  parallel 


mode,  but  are  required  to  communicate  in  order  to  co-ordinate  their  activities.  Reliable,  secure  communication 
systems  must  be  established  to  ensure  correct  operation.  Such  systems  are  not  only  functions  of  the  electrical 
hardware  employed,  but  also  of  the  software  support  provided.  Of  vital  importance  are  the  protocols  selected, 
which  define  and  detail  an  agreed  procedure  for  the  exchange  of  information 

This  paper  reviews  the  fundamental  software  considerations  in  the  design  of  computer  networks,  with  specific 
relevance  for  process-control  applications.  It  discusses  in  detail,  inter-connection  strategies  and  protocols  and 
briefly  examines  currently  adopted  schemes.  The  implications  of  fully  decentralized  system  control  are 
considered.  Of  particular  concern  is  the  question  of  the  production  of  reliable,  fault-tolerant,  secure  systems. 

[Peebles  78]  Peebles,  Richard  and  Eric  Manning. 

System  Architecture  for  Distributed  Data  Management. 

Computer  11(1):40-47,  January,  1978. 

Abstract 

Successful  implementation  of  most  distributed  processing  systems  hinges  on  solutions  to  the  problems  ot  data 
management,  some  of  which  arise  directly  from  the  nature  of  distributed  architecture,  while  others  carry  over  from 
centralized  systems,  acquiring  new  importance  in  their  broadened  environment  Numerous  solutions  have  been 
proposed  for  the  most  important  of  these  problems. 

In  a  distributed  computer  system,  multiple  computers  are  logically  and  physically  interconnected  over  "thin-wire" 
(low  bandwidth)  channels  and  cooperate  under  decentralized  system-wide  control  to  execute  application 
programs.  Examples  of  thin-wire  systems  are  Arpanet,  the  packet-switched  network  of  the  U.S.  Defense 
Communications  Agency,  and  Mininet,  a  transaction-oriented  research  network  being  developed  at  the  University 
of  Waterloo.  These  may  be  contrasted  with  high-bandwidth  or  "thick-wire"  multiprocessor  architectures,  such  as 
the  Honeywell  6080  or  the  Pturibus  IMP.  A  practical  consequence  of  thin-wire  design  is  that  processing  control  is 
in  multiple  centers.  No  one  processor  can  coordinate  the  others;  all  must  cooperate  in  harmony  as  a  community  of 
equals. 

The  key  issue  d  that  interprocess  communication  is  at  least  an  order  of  magnitude  slower  when  the 
communicating  tasks  are  in  separate  computers  than  it  is  when  they  are  executing  in  the  same  machine. 
Therefore,  no  single  process  can  learn  the  global  state  of  the  entire  system  nor  issue  control  commands  quickly 
enough  for  efficient  operation,  so  that  multiple  centers  of  control  are  implied. 

[Pouzin  73]  Pouzin,  Louis. 

Presentation  and  Major  Design  Aspects  of  the  Cycfades  Computer  Network. 

In  Proceedings.  Third  Data  Communications  Symposium,  pages  80:87.  IEEE  and 
ACM,  November,  1973. 

Abstract 

A  computer  network  is  being  developed  in  France,  under  government  sponsorship,  to  link  about  twenty 
heterogeneous  computers  located  in  universities,  research  and  O.P.  Centers.  Goafs  are  to  set  up  a  prototype 
network  in  order  to  foster  experiment  in  various  areas,  such  as:  data  communications,  computer  interaction, 
cooperative  research,  distributed  data  bases.  The  network  is  intended  to  be  both,  an  object  for  research,  and  an 
operational  tool. 

In  order  to  speed  up  the  implementation,  standard  equipment  is  used,  and  modifications  to  operating  systems  are 
minimized.  Rather,  the  design  effort  bears  on  a  carefully  layered  architecture,  allowing  for  a  gradual  insertion  of 
specialized  protocols  and  services  tailored  to  specific  application  and  user  classes. 

[Pouzin  75a]  Pouzin,  Louis. 

Virtual  Call  Issues  in  Network  Architectures. 

Technical  Report  SCH  559.1,  Institut  de  Recherche  d’lnformatique  et 
d’Automatique  (IRIA),  September,  1975. 

Abstract 

The  concept  of  virtual  circuit  is  mainly  used  to  designate  a  set  of  end-to-end  control  mechanisms  in  packet 
switching  networks.  Similar  mechanisms  called  liaisons  may  be  found  at  higher  levels  in  a  computer  network. 
Their  properties  are  reviewed,  specifically  with  regard  to  port  access,  error  and  flow  control.  Various  forms  of 
virtual  circuits  are  included  in  existing  or  planned  packet  networks.  But  some  networks  have  none.  Since 
end-to-end  control  mechanisms  always  exist  at  higher  levels,  it  is  not  clear  that  virtual  circuits  in  packet  networks 
are  worth  their  cost 

Interfacing  computer  systems  with  virtual  circuits  raises  a  number  of  problems  specifically  in  splicing  with  liaisons 


at  a  higher  level.  Another  approach  is  a  gateway  mimicking  terminals.  Finally,  the  least  interfering  approach  is  to 
consider  virtual  circuits  as  a  substitute  to  real  ones. 

[Pouzin  75b]  Pouzirt,  Louis. 

An  Integrated  Approach  to  Network  Protocols. 

Technical  Report  NCP  500.1,  Institut  de  Recherche  d'lnformatique  et 
d’Automatique  (IRIA),  May,  1975. 

Abstract 

Host-to-host  protocols  (H-H)  for  heterogeneous  computer  networks  are  still  in  infancy.  So  far  very  few 
implementations  are  in  existence.  Among  those  on  which  documentation  is  available  are  Arpanet  and  Cyclades. 
The  former  provides  only  for  basic  services  allowing  the  transfer  of  up  to  1000  octet  messages,  with  How  control 
but  not  error  control.  The  latter  allows  up  to  32000  octet  messages,  with  error  and  flow  control.  Both  are  similar  in 
the  sense  that  they  offer  only  a  message  transfer  service,  which  is  intended  tor  building  higher  level  protocols 
more  appropriate  for  specific  uses.  Since  data  to  be  transfered  are  usually  structured  in  various  ways,  a  traditional 
approach  is  to  superimpose  additional  layers  of  specific  protocols,  each  one  dealing  with  a  particular  level  of 
structure.  While  being  functionally  correct  this  approach  leads  to  heterogeneity,  redundancy  and  overhead 
among  the  various  layers. 

[Pouzin  76]  Pouzin,  Louis. 

Virtual  circuits  vs.  datagrams  -  Technical  and  political  problems. 

Technical  Report  SCH  576.1 ,  Institut  de  Recherche  d’lnformatique  et 
d'Automatique  (IRIA),  June,  1976. 

Abstract 

Public  packet  networks  are  becoming  a  reality,  and  call  for  interface  standards.  Two  levels  of  facilitcs  have  been 
proposed,  virtual  circuit  (VC),  and  datagram  (DG).  The  concepts  of  VC  and  DG  are  already  well  developed  within 
computer  networks.  Their  properties  are  reviewed,  along  with  typical  issues  such  as  out-of-sequence  and 
congestion  problems. 

Usually  OG's  are  a  sub-layer  used  as  a  transport  facility  by  a  VC  protocol.  They  also  provide  the  ability  to  extend 
switclimg  functions  within  user  systems.  The  characteristics  of  VC'S  considered  by  CCITT  are  examined  critically, 
and  related  to  experimental  networks  and  manufacturer  softwares. 

VC's  and  DG's  are  compared  from  the  viewpoint  of  adapting  customer  systems  to  public  networks.  When  the 
customer  is  interested  in  a  transport  facility.  OG's  appear  to  have  an  edge.  When  a  network  becomes  a  terminal 
handler,  adaptations  are  more  complex  and  require  character  stream  interfaces.  Intelligent  teminals  would  make 
this  problem  disappear,  as  they  can  use  a  DG  interface. 

Although  various  groups  call  for  a  OG  interface,  the  carriers  are  opposed  to  it.  Four  carriers  are  rushing  a  VC 
protocol  through  CCm.  The  carrier's  goal  is  to  take  over  termnial  handling,  and  gradually  other  processing 
functions.  DG's  would  leave  too  much  freedom  to  the  customer.  The  political  implications  of  the  carrier  policy 
suggest  that  better  boundaries  be  drawn  up  between  carriers  and  data  processing. 

[Proebster  78]  Proebster,  W,  E.  and  V.  Sadogopan. 

Communication  Technology  and  Concepts:  Technical  Status  and  Outlook. 
Communication  Technology  31966,  IBM,  December,  1978. 

Abstract 

The  most  important  communication  concepts  are  described  in  a  systematic  way,  covering  the  hierarchy  of 
communication  elements,  systems  and  services. 

Current  technical  implementations  of  these  concepts  are  discussed  and  the  major  future  trends  are  highlighted. 
Special  emphasis  is  given  to  fiber  optics,  communication  satellites,  large  scale  integration  and  microprocessors. 

[Quatse  72]  Ouatse,  Jesse  T„  Pierre  Gaulene  and  Donald  Dodge. 

The  External  Access  Network  of  a  Modular  Computer  System. 

In  Proceedings,  Spring  Joint  Computer  Conference,  pages  118-125.  IEEE,  1972. 

Abstract 

A  modular  time-sharing  computer  system,  called  PRIME,  is  currently  under  development  at  the  University  of 
California,  Berkeley.  Basically,  PRIME  consists  ol  sets  of  modules  such  as  processors,  primary  memory  modules, 
and  disk  drives,  which  are  dynamically  reconfigured  into  separate  subsystems.  One  ramification  of  the 
architectural  approach  is  the  need  for  a  medium  to  accommodate  three  classes  of  communications:  (1)  those 
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between  any  processor  and  any  other  processor.  (2)  those  between  any  processor  and  any  disk  dnve,  external 
computer  system,  or  other  device  in  the  facility  pool,  and  (3)  those  between  primary  memory  and  any  device  in  the 
facility  pool.  This  paper  describes  the  External  Access  Network  (EAN)  which  was  developed  for  this  purpose.  The 
EAN  is  specialized  by  certain  PRIME  implementation  constraints.  Otherwise,  it  is  adaptable  to  any  system  having 
similar  design  obiectives.  or  to  aggregates  of  independent  computer  systems  at  the  same  site,  which  share  a 
similar  facility  pool  and  which  require  system  to  system  communications. 

[Rao  80]  Rao,  Ram. 

Design  and  Evaluation  of  Distributed  Communication  Primitives. 

In  Proceedings,  ACM  Pacific  '80,  pages  14-23.  ACM,  November,  1980. 

Abstract 

Communication  primitives  suitable  for  use  in  a  distributed  programming  environment  are  the  focus  of.  this  paper. 
The  design  of  such  primitives  involves  issues  such  as  communication  model,  synchronization,  selective  receiving, 
message  length,  naming  and  buffering.  Alternatives  for  handling  these  design  issues  are  presented  and  their 
mutual  dependencies  are  defined.  A  number  of  existing  primitives  are  examined  m  the  light  of  these  design 
decisions. 

Two  benchmark  problems  are  defined  and  programs  for  them  have  been  written  using  each  set  of  primitives.  The 
programming  experience  is  used  to  evaluate  the  privitives  and  draw  conclusions  about  the  design  of 
communication  primitives  for  various  distributed  programming  environments. 

[Rao  82]  Rao,  Ram. 

A  Kernel  for  Distributed  and  Shared  Memory  Communication. 

Technical  Report  82-06-01 ,  Department  of  Computer  Science,  University  of 
Washington,  June,  1982. 

Abstract 

Interprocess  communication  via  shared  memory  has  received  considerable  attention  in  the  past.  More  recently, 
there  has  been  a  growing  interest  in  communication  in  distributed  environments.  This  dissertation  examines 
distributed  communication  and  attempts  to  ihtergrate  it  with  shared  memory  communication.  '  A  kernel  is 
presented  which  provides  simple  tools  to  facilitate  communication  in  these  environments,  and  allows  definition  of 
new  communication  mechanisms.  The  kernel  consists  of  features  for  synchronization  and  data  transfer  and 
locking.  By  combining  the  synchronization  and  data  transfer  facilities,  distributed  communication  may  be 
modelled.  Shared  memory  communication  normally  requires  synchronization  and  locking.  Some  applications 
require  both  shared  memory  and  distributed  communication.  Such  “hybrid"  applications  typically  use  the  kernel 
features  for  synchronization,  data  transfer  and  locking.  The  kernel  operations,  though  somewhat  low  level, 
provide  flexibility  in  designing  efficient  mechanism  well  suited  tor  specific  applications.  Several  examples  illustrate 
the  use  of  the  kernel  m  programming  solutions  to  a  variety  of  communication  problems,  as  well  as  in  modelling 
some  programming  language  mechanisms  (including  Ada  and  CSP). 

[Rashid  80]  Rashid,  Richard  F. 

An  Inter-Process  Communication  Facility  for  UNIX, 

Technical  Report  124,  Department  of  Computer  Science,  Carnegie- Mellon 
University,  February,  1980. 

Abstract 

An  inter-process  communication  facility  .implemented  3t  Carnegie-Mellon  University  for  VAX/UNIX  version  seven  is 
described.  This  facility  was  designed  to  provide  language,  operating  system  and  machine  independent 
communication  between  processes  performing  distributed  computations.  Its  relationships  to  previously  existing 
UNIX  facilities  and  other  systems  for  distributed  computing  are  discussed. 

[Rashid  81]  Rashid,  Richard  F.  and  George  G.  Robertson. 

Accent:  A  Communication  Oriented  Network  Operating  System  Kernel. 

Technical  Report  123,  Department  of  Computer  Science,  Carnegie-Mellon 
University,  April,  1981. 

Abstract 

Accent  is  a  communication  oriented  operating  system  kernel  being  built  at  Carnegie-Mellon  University  to  support 
the  distributed  personal  computing  project,  Spice,  and  the  development  of  a  fault-tolerant  distributed  sensor 
network  (OSN).  Accent  is  built  around  a  single,  powerful  abstraction  of  communication  between  processes,  with 
all  kernel  functions,  such  as  device  access  and  virtual  memory  management  accessible  through  messages  and 
distributable  throughout  a  network.  In  this  paper,  specific  attention  is  given  to  system  supplied  facilities  which 


support  transparent  network  access  and  fault-tolerant  behavior.  Many  of  these  facilities  are  already  being 
provided  under  a  modified  version  of  VAX/UNIX.  The  Accent  system  itself  is  currently  being  implemented  on  the 
Three  Rivers  Corp.  PERQ. 

[Rashid  82]  Rashid.  Richard  F. 

Accent  Kernel  Interface  Manual. 

Technical  Report  ??,  Department  of  Computer  Science,  Carnegie- Mellon  University, 
1982. 

Abstract 

Accent  is  a  communication  oriented  operating  system  kernel  being  built  at  Camegie-Mellon  University  to  support 
the  distributed  personal  computing  project.  Spice,  and  the  development  of  a  fault- tolerant  distributed  sensor 
network  (DSN).  Accent  is  built  around  a  single,  powerful  abstraction  of  communication  between  processes,  with 
all  kernel  funcations.  such  as  device  access  and  virtual  memory  management  accessible  through  messages  and 
distributable  throughout  a  network.  In  this  manual,  specific  attention  is  given  to  the  program  interface  to  the 
Accent  kernel.  Many  of  the  facilities  described  (in  particular  IPC  related  facilities)  are  already  being  provided 
under  a  modified  version  of  VAX/UNIX.  The  Accent  system  itself  is  currently  being  implemented  on  the  Three 
Rivers  Corporation  PERQ. 

[Redell  80]  Redell,  David  D.,  Vogen  K.  Dalai,  Thomas  R.  Horsley,  Hugh  C.  Lauer,  William 
C.  Lynch,  Paul  R.  McJones,  Hal  G.  Murray  and  Stephen  C.  Purcell. 

Pilot:  An  Operating  System  for  a  Personal  Computer. 

Communications  of  the  ACM  23(2):81-92,  February,  1980. 

Abstract 

The  Pilot  operating  system  provides  a  single-user,  single-language  environment  for  higher  level  software  on  a 
powerful  personal  computer.  Its  features  include  virtual  memory,  a  large  "flat"  file  system,  streams,  network 
communication  facilities,  and  concurrent  programming  support  Pilot  thus  provides  rather  more  powerful  facilities 
than  are  normally  associated  with  personal  computers.  The  exact  facilities  provided  display  interesting  similarities 
to  and  differences  from  corresponding  facilities  provided  in  large  multi-user  systems.  Pilot  is  implemented  entirely 
in  Mesa,  a  high  level  system  programming  language.  The  modularization  of  the  implementation  displays  some 
interesting  aspects  in  terms  of  both  the  static  structure  and  dynamic  interactions  of  the  various  components. 

[Reed  80]  Reed,  David  and  Liba  Svobodova. 

SWALLOW:  A  Distributed  Data  Storage  System  for  a  Local  Network. 

In  Proceedings,  IF  IP  Working  Group  6.4  International  Workshop  on  Local  Networks 
for  Computer  Communications,  pages  355-373.  IBM,  August,  1980. 

Abstract 

SWALLOW  is  an  experimental  project  that  will  test  feasability  of  several  advanced  ideas  on  the  design  of 
object-oriented  distributed  systems.  Its  purpose  is  to  provide  a  reliable,  secure  and  efficient  storage  in  a 
distributed  environment  consisting  of  many  personal  machines  and  one  or  more  shared  data  storage  servers. 
SWALLOW  implements  a  uniform  interface  to  all  objects  accessabfe  from  a  personal  computer:  these  objects  can 
be  stored  either  on  the  local  storage  device  or  in  one  of  the  data  storage  servers.  The  data  storage  servers  provide 
stable,  reliable,  and  long-term  storage.  The  access  control  to  objects  in  the  data  storage  servers  is  based  on 
encrypting  the  data;  encryption  is  used  to  prevent  both  unauthorized  release  of  information  and  unauthorized 
modification.  SWALLOW  can  handle  efficiently  both  very  small  and  very  large  objects  and  it  provides  mechanisms 
for  updating  of  a  group  of  objects  at  one  or  more  physical  nodes  in  a  single  atomic  action. 

[Reid  80]  Reid,  Lorretta  Guarino. 

Control  and  Communication  in  Programmed  Systems. 

PhD  thesis,  Department  of  Computer  Science,  Carnegie-Mellon  University, 
September,  1980. 

Abstract 

The  paper  "On  the  Duality  of  Operating  Systems  Structures"  by  Lauer  and  Needham  (1978)  was  an  extrememly 
controversial  paper.  It  claimed  to  have  demonstrated  an  important  result  about  communication  in  operating 
systems,  but  it  left  many  people  uneasy  and  unsure  of  exactly  what  had  been  demonstrated.  Attempts  to  formalize 
the  results  of  the  paper  by  casting  it  in  terms  of  known  models  of  systems  failed,  primarily  because  the  models 
lacked  the  ability  to  represent  the  dynamic  nature  of  systems. 

A  model  of  communication,  consisting  of  primitive  objects  and  communication  operations,  is  developed  in  this 
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thesis  in  order  to  study  communication  properties  of  systems.  Some  of  the  goals  of  the  design  of  the  model  were 
to  permit  us  to  deal  with  the  dynamic  recreation  and  destruction  of  pieces  of  systems,  to  permit  the  description  of 
systems  programmed  in  a  wide  variety  of  languages  and  implemented  on  a  wide  range  of  architectures,  and  to 
provide  some  support  for  flexibility  and  the  localization  of  communication  knowledge  in  systems. 

Although  the  primitives  of  the  model  are  sufficient  to  describe  communication  in  systems,  working  with  them 
directly  is  much  like  programming  only  with  GO  TO's.  There  is  a  lot  of  structure  to  the  way  that  communications 
takes  place,  and  this  structure  is  not  explicitly  visible  in  the  use  of  primitives.  The  thesis  introduces  a  notation  lor 
describing  abstract  communication  constructs  in  a  way  that  permits  the  structure  to  be  expressed  precisely  and 
mat  allows  constructs  to  be  compared  for  their  similarities  and  differences. 

The  thesis  uses  the  model  and  the  notion  of  abstract  communication  constructs  to  explore  several  issues  in 
communication.  In  an  effort  to  characterize  the  properties  of  communication  that  make  it  easy  to  use  programs  in 
many  different  systems,  the  criterion  of  the  flexibility  of  an  implementation  is  developed.  The  Lauer-Needham 
paper  is  discovered  not  to  demonstrate  the  duality  of  the  two  types  of  operating  systems  but  to  introduce  an 
abstract  communication  construct  that  is  flexible  enough  to  be  implemented  directly  in  both  systems.  Finally,  a 
proof  is  given  of  the  necessary  and  sufficient  conditions  on  communication  for  a  system  to  be  completely 
sequential. 

[Rett  75]  Rett,  David  L. 

Operating  System  Design  Considerations  for  the  Packet-Switching  Environment. 

In  Proceedings.  National  Computer  Conference,  pages  155-160.  AFIPS,  1975. 
Volume  44. 

Abstract 

One  of  the  striking  developments  in  computing  and  communication  technology  during  the  past  decade  is  reflected 
in  the  evolution  of  packet-switching  computer  networks.  Packet-switching  communication  techniques  allow 
dynamic  allocation  of  a  set  of  communication  resources  (circuits)  so  that  they  may  be  flexibly  shared  among  a 
number  of  autonomous  processors.  Implementation  of  such  packet-switching  networks  has  required  many  design 
decisions,  such  as  the  choice  of  network  topology,  routing  strategies,  and  the  establishment  of  conventions,  or 
protocols,  for  information  interchange  between  network  resources. 

This  paper  is  concerned  with  the  design  requirements  of  Host  operating  systems:  those  systems  whose  primary 
business  is  the  management  of  computing  resources  rather  than  communication  resources.  Low-level 
communication  tasks  such  as  routing  fall  outside  the  realm  of  the  Host  responsibilities  discussed  here  and  are 
performed  by  means  of  a  sub-network  of  small  computers  dedicated  of  the  task  of  packet-switching,  in  the 
ARPANET  these  computers  are  called  interface  Message  Processors,  or  imps,  and  use  packet-switching 
techniques  to  communicate  via  50-kilobit  common  carrier  circuits.  Each  IMP  provides  up  to  four  high-speed 
synchronous  serial  ports  to  which  Hosts  connect  using  special-purpose  Host- IMP  interfaces.  Packet-switching 
network  environments  place  special  requirements  on  the  design  of  the  connected  Host  operating  systems. 
Attachment  to  the  ARPANET,  for  example,  has  required  a  number  of  additions  or  modifications  to  existing 
operating  systems.  There  are  certain  structural  features  which  must  be  incorporated  in  system  design  in  order  to 
facilitate  effective  use  of  distributed  computing  resources.  We  begin  by  examining  a  few  of  these  features. 

[Rowe  75]  Rowe,  Lawrence  A. 

The  Distributed  Computing  Operating  System. 

Technical  Report  66,  Department  of  Information  and  Computer  Science.  University 
of  California,  Irvine,  June,  1975. 

Abstract 

The  Distributed  Computing  System  (DCS)  is  a  computer  network  architecture  emphasizing  reliable,  fail-soft  service 
of  an  operating  system  for  a  DCS.  Issues  discussed  include  interprocess  communication,  system  initiation,  and 
failure  detection  and  recovery.  Features  of  the  implementation  of  a  prototype  system  and  some  experiences 
gained  from  building  and  using  the  prototype  are  also  described. 

Conclusions  made  from  this  work  are  that  problems  and  solutions  discovered  while  developing  minicomputer 
networks  are  the  same  as  those  encountered  in  developing  networks  of  larger  machines.  Specifically,  DCS  and  its 
operating  system  demonstrate  that  systems  without  centralized  control  can  he  constructed,  that  broadcast 
messages  are  useful,  and  that  messages  which  are  sent  to  a  process  but  are  intercepted  and  acted  upon  by  the 
environment  of  the  receiving  process  are  necessary  to  achieve  location  independence. 
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[Ruschitzka  73]  Ruschitzka,  M.  G.  and  R.  S.  Fabry. 

The  Prime  Message  System. 

In  Digest  o I  Papers.  COMPCON  73,  pages  125-128.  IEEE,  February,  1973. 

Abstract 

The  message  system  of  the  PRIME  system  which  is  currently  being  constructed  at  the  University  of  California  at 
Berkeley  combines  the  addressing  generality  of  a  network  message  system  with  a  cost  conscious  implementation 
typical  of  single  processor  systems.  This  paper  deals  with  its  design  and  implementation  details. 

[Ryan  79a]  Ryan,  M.  0. 

Design  of  a  Distributed  System:  Overview  of  System. 

The  Australian  Computer  Journal  1 1  (3):9S-102,  August,  1979. 

Abstract 

When  consideration  is  given  to  distributed  processing  many  different  ideas  are  encountered.  This  is  due  to  the. 
fact  that  there  are  no  clear  concepts  involved  and  much  of  the  effort  has  been  hardware  rather  than  software 
driven.  However,  there  is  one  clear  thing  about  distributed  processing  and  that  is  it  is  about  communication,  and 
lor  any  application  to  be  implemented  there  must  be  a  solid  basis  of  communication  on  which  to  build.  This  protect 
is  to  do  just  that 

The  basic  design  was  started  after  a  reasonable  survey  of  the  literature  was  carried  out.  however  this  led  to  the 
biggest  problem  of  all.  which  was  the  separation  of  the  concepts  involved  into  the  relevant  areas.  This  led  to  a 
great  deal  of  wasted  effort.  Despite  this  approach  certain  biases  have  had  a  great  influence,  hopefully  for  the 
better,  on  the  final  design. 

[Ryan  79b]  Ryan,  M.  D. 

Design  of  a  Distributed  System:  Interprocess  Communication. 

The  Australian  Computer  Journal  11  (3):  103- 107,  August,  1979. 

Abstract 

In  this  paper  interprocess  communication  is  discussed  within  the  constraints  outlined  in  the  companion  paper 
(Ryan,  1979).  The  companion  paper  discusses  the  overall  concepts  of  a  distributed  system  and  emphasizes  the 
role  played  by  interprocess  communication  in  such  systems.  However,  it  asset  ts  that  interprocess  communication 
should  be  designed  within  Pie  host  operating  system  and  then  extended  to  a  network  environment 

[Saettone  78]  Saettone,  R. 

M  I T  S :  Microprocessor  Implementation  of  a  Jransport  Station. 

In  Proceedings.  Computer  Network  Protocols  Symposium,  pages  E3-1  -  E3-5. 
Universite’  De  Liege,  February,  1978. 

Abstract 

This  paper  describes  the  implementation  of  the  communication  interface  and  transport  protocol  in  the  link 
between  the  CYCLADES  network  and  a  hast  computer,  such  as  the  IBM  360/67  at  the  University  of  Grenoble. 

System  throughout  is  increased  by  a  multi-microprocessor  architecture  that  executes  not  only  the  communication 
functions  associated  to  a  serial  data  link,  but  also  most  of  the  functions  of  the  transport  protocol  used  at  the  front 
of  the  network.  It  interfaces  on  one  side  with  a  serial,  synchronous  full  duplex  line  via  a  modem.  On  the  other  side, 
it  is  attached  to  the  channel  of  the  host's  I/O  processor,  as  a  peripheral  device. 

The  main  goal  of  this  approach  is  to  relieve  the  host  from  all  the  communication  functions  and  to  execute  them  in  a 
functionally  equivalent  peripheral  device.  A  considerable  reduction  in  the  cost/performance  ratio  is  obtained  by 
the  use  of  general  purpose  microprocessors  instead  of  a  front-end  mini-computer  or  a  special-purpose  processor. 

[Sakai  77]  Sakai,  Toshiyuki,  Tsunetoshi  Hayashi,  Shigeuoshi  Kitazawa,  Koichi  Tabata  and 

Takeo  Kanade. 

Inhouse  Computer  Network  Kuipnet 

In  Proceedings,  Information  Processing  77,  pages  161-166.  IFIP,  1977. 

Abstract 

The  inhouse  resource  sharing  computer  network  KUIPNET  (Kyoto  University  Information  Processing  NET work)  is 
described.  It  is  intended  to  support  advanced  researclies  in  information  processing  by  sharing  resources  such  as 
files  and  devices  among  participating  host  computers  in  one  building.  The  network  can  handle  raw  data  such  as 
digitized  image  and  stweeh  signals  os  well  as  character-oriented  message  data.  Design  consideration  of  the 
network  and  the  operating  systems  of  host  computers  ore  described.  Some  examples  of  applications  using  the 


network  are  presented  as  well  as  results  ol  traffic  measurement. 

[Schantz  75]  Schantz,  Richard  E. 

A  Commentary  on  Procedure  Calling  as  a  Network  Protocol. 

Technical  Report  RFC  #684,  ARPA  Network  Working  Group,  April,  1975. 

Abstract 

While  the  Procedure  Call  Protocol  (PCP)  and  its  use  within  the  National  Software  Works  (NSW)  context  attacks 
many  of  the  problems  associated  with  integrating  indepenent  computing  systems  to  handle  a  distributed 
computation,  it  is  our  feeling  that  its  design  contains  flaws  which  should  prevent  its  widespread  use.  and  in  our 
view,  limit  its  overall  utility.  We  are  not  voicing  our  objection  to  the  use  of  PCP,  in  its  current  definition,  as  the  base 
level  implementation  vehicle  for  the  NSW  proiect.  It  is  already  too  late  for  any  such  objection,  and  PCP  may,  in 
fact,  be  very  effective  lor  the  NSW  implementation,  since  they  are  proceeding  in  parallel  and  probably  influenced 
each  other.  Rather,  we  are  voicing  an  objection  to  the  “PCP  philosophy",  in  the  hope  of  preventing  this  type  of 
protocol  from  becoming  the  de-facto  network  standard  for  distributed  computation,  and  in  the  hope  of  influencing 
the  future  direction  of  this  and  similar  efforts. 

[Schlichting  82]  Schlichting,  Richard  D.  and  Fred  B.  Schneider. 

Using  Message  Passing  for  Distributed  Programming  Proof  Rules  and  Disciplines. 
Technical  Report  TR  82-491,  Department  of  Computer  Science,  Cornell  University, 
May,  1982. 

Abstract 

Inference  rules  for  proving  the  partial  correctness  of  concurrent  programs  that  use  message-passing  for 
synchronization  and  communication  are  derived.  Three  types  of  message-passing  primitives  are  considered: 
synchronous,  asynchronous  and  remote  procedure  call  (rendezvous).  The  proof  rules  show  how  interference  can 
arise  and  be  controlled.  They  also  provide  insight  into  why  distributed  programs  are  hard  to  design  and 
understand. 

[Schmid  74]  Schmid,  Hans  Albrecht. 

'  An  Approach  to  the  Communication  and  Synchronization  of  Processes. 

In  Proceedings.  1973  International  Computing  Symposium,  pages  165-171.  IFIP, 
April,  1974. 

Abstract 

For  the  communication  of  concurrent  processes  we  introduce  primitives  which  allow  uniform  modelling  of 
competition  for  devices,  as  well  as  of  cooperation  which  taken  piece  by  exchange  of  synchronization  signals. 
Using  these  primitives,  process  systems  are  split  up  into  processes  independent  of,  and  processes  communicating 
with  the  environment  This  allows  easy  transformation  of  process  systems  into  petri  nets.  Petri  nets,  as  an  abstract 
mathematical  tool,  seem  to  be  appropriate  to  the  treatment  of  all  problems  caused  by  interaction  of  concurrent 
processes,  as  for  example  deadlocks  and  their  prevention. 

[Sherman  82]  Sherman,  Richard  H.,  Melvin  G.  Gable  and  Anthony  Chung. 

Overcoming  Local  and  Long-Haul  Incompatibility. 

In  DATA  COMMUNICATIONS,  pages  195-206.  March,  1982. 

Abstract 

There  are  long-distance  data  networks-com posed  of  switched  or  leased  facilities, point-to-point,  or  multipoint 
connections-and  now  there  are  local  networks.  Most  agree  that  tlie  two  will  have  to  interconnect  and  some 
progress  is  being  made  in  this  area.  But  it  remains  unclear  how  this  can  be  done  while  retaining  end-to-end 
network  efficiency,  reliability,  connectivity,  and  cost-effectiveness. 

A  network  protocol  layer  is  needed  that  can  adapt  to  the  evolution,  operation,  and  interconnection  of  such  diverse 
networks.  The  network  should  accommodate  computers  that  implement  different  network  protocols,  and  the 
network  components,  such  as  interfaces  and  computers,  should  be  as  easy  to  install  as  modems-without  requiring 
communications  or  computer  specialists.  Some  modems,  for  example,  can  now  sense  the  data  rate  and 
modulation  scheme  and  automatically  adapt.  Whole  networks  should  be  able  to  merge  with  or  separate  from  other 
networks  as  easily  as  individual  network  components  are  added  and  removed. 

In  view  of  this,  an  experimental  network  has  been  developed  at  Ford  in  an  attempt  to  implement  these  evolutionary 
and  operational  objectives.  Different  types  of  networks  wore  interconnected  using  a  uniform  network  protocol 
layer  developed  to  perform  measurement  and  cc-Uv.xurxUiorat, 
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(Shoch  79]  Shoch,  John  F. 

An  Overview  of  the  Programming  Language  Smalltalk-72. 

ACM-SIGPLAN  Notices  14(9):64-73,  September,  1979. 

Abstract 

Smalltalk  a  a  programming  language  designed  around  a  single  metaphor-  that  similar  obtects  can  be  grouped 
into  more  general  classes.  Starting  with  a  conceptually  elegant  and  consistent  epistemology,  it  has  been  possible 
to  construct  a  language  with  powerful  semantic  capabilities,  while  retaining  a  simple  syntactic  representation. 

The  language  development  itself  is  but  one  part  of  a  broader  effort  to  explore  the  ways  in  which  people  can 
manipulate  information  and  communicate  with  machines  It  is  one  tool  utilized  in  the  construction  of  an  interactive 
computer  system,  used  by  both  children  and  adults  lor  problem  solving,  simulation,  drawing  and  painting,  real  time 
generation  of  music,  information  retrieval,  and  other  tasks. 

[Silberschatz  81  ]  Silberschatz,  Abraham. 

A  Note  on  the  Distributed  Program  Component  Cell. 

ACM-SIGPLAN  Notices  16(7).  July,  1981. 

Abstract 

This  paper  presents  a  new  language  construct  for  distributed  computing.  This  construct,  called  a  cell,  allows  one 
to  simulate  a  variety  of  language  constructs.  Its  salient  features  provide  the  programmer  with  an  effective 
synchronization  scheme,  and  a  mechanism  to  control  the  order  in  which  various  activities  within  a  cell  should  be 
executed. 

[Sloman  80]  Sloman,  M.  S.  and  S.  Prince. 

Local  Network  Architecture  for  Process  Control. 

In  Proceedings,  IFIP  Working  Group  6.4  International  Workshop  on  Local  Networks 
lor  Computer  Communications,  pages  407-427.  IBM,  August,  1980. 

Abstract 

The  physical  distribution  of  equipment  and  machinery  on  an  industrial  site  makes  it  particularly  suitable  for 
implementing  distributed  computer  control  systems.  There  is  also  a  need  for  a  serial  communication  system  even 
in  a  centralised  control  so  as  to  save  on  wiring  costs,  which  can  be  substantial. 

This  paper  identifies  the  communication  requirements  for  Distributed  Process  Control  Systems  and  indicates  the 
main  differences  between  Process  Control  and  other  application  areas. 

A  network  architechture  for  Process  Control  which  caters  for  arbitrary  point-to-point  or  broadcast  data  links  is 
presented.  The  architecture  is  based  on  the  lower  4  layers  of  the  ISO  Open  Systems  Model  The  services  provided 
and  functions  performed  by  each  layer  is  described.  Network  management  is  also  briefly  discussed. 

[Solomon  79]  Solomon,  Marvin  H.  and  Raphael  A.  Finkel. 

The  Roscoe  Distributed  Operating  System. 

In  Proceedings,  Seventh  Symposium  on  Operating  Systems  Principles,  pages 
108-114.  ACM,  December,  1 979. 

Abstract 

Roscoe  is  on  operating  system  implemented  at  the  University  of  Wisconsin  that  allows  a  network  of 
microcomputers  to  cooperate  to  provide  a  general-purpose  computing  facility.  After  presenting  an  overview  of  the 
structure  of  Roscoe.  this  paper  reports  on  experience  with  Roscoe  and  presents  several  problems  currently  being 
investigated  by  the  Roscoe  project. 

[Spector  81  a]  Specter,  Alfred  Z. 

Multiprocessing  Architectures  lor  Local  Computer  Networks. 

Technical  Report  STAN-CS-81-874,  Department  of  Computer  Science,  Stanford 
University,  August,  1981. 

[cute  title]. 

Abstract 

This  dissertation  discusses  the  interconnection  of  computers  with  very  high  speed  local  networks  in  a  manner  that 
can  support  a  large  class  of  distributed  programs  -  a  class  that  includes  programs  requiring  highly  efficient 
interprocessor  communication.  This  research  is  motivated  by  {1}  prospects  for  local  networks  having  a  capacity 
of  100  megabits/second  or  higher;  {2}  continuing  advances  in  semiconductor  technology;  (3}  the  increasing 
availability  of  inexpensive,  low-latency,  non-volatile  storage;  and  {4}  inadequacies  in  existing  software  technology 


that  prevent  these  technological  advances  from  being  fully  exploited. 

In  the  early  sections  of  this  work,  the  primary  thesis  S  developed ;  it  explicitly  presents  the  properties  that  we 
require  of  a  local  network- based  multiprocessor.  The  analysis  and  validation  of  this  thesis  leads  to  four  major 
contributions.  The  first  is  a  comparison  ol  very  high  speed  ring  and  broadcast  networks  when  they  are  used  with 
short  packets.  As  part  of  this  comparison,  a  new  analytic  model  is  presented,  whose  solution  yields 
delay/throughput  data  for  token  rings. 

The  second  major  contribution  is  a  new  communication  model  for  local  computer  networks  whereby  processes 
execute  generalized  remote  references  that  cause  operations  to  be  performed  by  remote  processes.  This  remote 
reference/remote  operation  model  provides  a  taxonomy  of  primitives  that  are  naturally  useful  in  many  applications 
and  can  be  specially  implemented  to  provide  for  high  efficiency  Example  communication  primitives  and 
techniques  for  their  implementation  are  provided  to  show  the  utility  of  the  mode). 

Following  these  discussions,  we  present  experience  with  the  implementation  of  one  class  of  remote  references. 
These  references  take  about  ISO  microseconds  or  50  average  macroinstruction  times  to  perform  on  Xerox  Alto 
computers  connected  by  a  2.97  megabit  Ethernet.  This  experiment  demonstrates  the  power  of  special-casing 
communication  primitives  and  helps  to  validate  the  remote  reference/ remote  operation  model. 

Finally,  various  implementation  techniques  are  presented  that  can  be  used  for  a  real  communication  system  based 
upon  the  model.  We  discuss  such  topics  as  the  efficient  transmission  of  large  blocks  through  the  use  of  multiple 
small  packets  and  the  efficient  implementation  of  stable  storage. 

[Spector  81  b]  Spector,  Alfred  Z. 

Extending  Local  Network  interfaces  to  Provide  More  Efficient  Interprocessor 
Communication  Facilities. 

In  Proceedings.  Eighth  Symposium  on  Operating  Systems  Principles,  pages  6-13. 
ACM,  December,  1981. 

Abstract 

This  paper  describes  extensions  to  local  networking  interfaces  that  allow  high  speed  networks  of  processors  to  be 
used  in  certain  multiprocessor  applications.  If  network  interfaces  are  augmented  to  permit  more  efficient  message 
passing  as  well  as  direct  memory  access  to  other  processors'  memories,  distributed  applications  requiring 
frequent  interprocessor  communication  can  be  better  supported.  Resulting  systems  would  have  a  hybrid 
architecture  with  characteristics  of  both  shared  memory  and  message  passing  systems,  and  could  fully  use  the 
inherent  relibility  and  extremely  high  bandwidth  now  provided  by  local  networks. 

The  motivation  and  initial  implementation  plans  for  an  experimental  system  to  be  conducted  on  Xerox  Alto 
computers  are  discussed.  Included  in  the  prototype  will  be  new  machine  instructions  and  Ethernet  protocols  that 
allow  for  both  virtual  shared  memory  and  datagram  operations.  The  resulting  system  will  demonstrate  that 
interprocess  synchronization  and  communication  can  be  performed  much  more  efficiently  by  special  purpose 
firmware  than  by  current  message  passing  mechanisms. 

[Spier  73a]  Spier,  Michael  J. 

The  Experimental  Implementation  of  a  Comprehensive  Inter-Module 
Communication  Facility. 

In  Proceedings.  1973  Sagamore  Computer  Conference  on  Parallel 
Processing,  pages  ?•?  Syracuse  University,  August,  1973. 

Abstract 

In  1972,  The  Oigital  Equipment  Corporation  sponsored  a  limited-objective  research  protect  to  investigate  the 
properties  of  the  new  kernel/tlomnin  systems  architecture,  whose  theoretical  model  was  earlier  developed  by 
Spier.  A  companion  paper  reports  on  that  project.  The  domain  is  a  monitor  (or  supervisor,  executivel-like  local 
independent  address  space  which  may  be  mapped  over  a  collection  of  (mostly)  exclusive  memory  space  partitions 
to  provide  a  protected  runtime  environment.  Similar  to  the  classical  monitor,  control  may  be  transferred  into  the 
domain  through  predesignated  inter-domain  entry  points  named  gates  In  a  single  monolithic  monitor,  but  is 
distributed  among  a  number  of  supervisory  domains;  of  these,  the  most  central  and  most  critical  supervisory 
domain  is  named  kernel.  The  kernel  is  responsible  for  basic  resource  management  only  and  is  by  definition  devoid 
of  any  decision  making  code. 
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[Spier  73b] 


Spier,  Michael  J. 

Process  Communication  Prerequisites  or  the  IPC-Setup  Revisited. 

In  Proceedings,  1973  Sagamore  Computer  Conference  on  Parallel 
Processing,  pages  79-88.  Syracuse  University,  August,  1973. 

Abstract 

A  careful  examination  of  any  existing  inter-process  communication  (IPC)  mechanism  invariably  uncovers  the 
underlying  existence  of  a  more  fundamental  IPC  mechanism,  which  in  turn  is  built  on  a  yet  more  fundamental  IPC 
mechanism. ..etc. 

This  study  resolves  this  indefinite  recursion  of  a  self  defining  mechanism  by  proposing  a  certain  causality, 
expressed  in  terms  of  a  finite  list  of  process  communication  prerequisites,  and  based  on  a  nonmechamstic 
postulate  which  calls  for  an  area  of  communication  (or  mailbox)  that  is  by  its  very  nature  impervious  to  mutual 
interference  by  the  communicating  processes. 

Given  arbitrary  processes  for  which  these  prerequisites  hold,  we  may  logically  construct  the  "very  first" 
alemantary  IPC  mechanism,  i.e.,  the  one  which  is  not  dependent  upon  its  own  pre-existence.  Such  a  mechanism 
is  developed  in  this  paper  it  is  capable  of  transmitting  a  single,  one-way.  one-bit  message  among  processes. 

It  is  suggested  that  the  proposed  causality,  although  arbitrary  in  many  ways  (and  openly  admitted  as  such). may 
serve  as  a  convenient  intellectual  tool  with  which  autonomous  sequential  processes  may  be  observed  and  studied. 

[Spratt  81  ]  Spratt,  E.  Brian. 

Operational  Experiences  with  a  Cambridge  Ring  Local  Area  Network  in  a  University 
Environment. 

In  Proceedings,  IFIP  Working  Group  6.4,  International  Workshop  on  Local 
Networks,  pages  81 -106.  IBM,  August,  1981. 

Abstract 

The  University  of  Kent  Computing  Laboratory  is  responsible  both  for  the  central  University  Computing  Service  and 
for  teaching  and  research  in  Computer  Science.  At  the  time  of  writing  there  is  a  total  of  some  1300  users,  out  of  a 
total  University  population  of  4100. 

Interest  in  the  Laboratory  in  Local  Area  Networks  goes  back  to  the  early  seventies. 

[Stankovic  82]  Stankovic,  John  A. 

Software  Communication  Mechanisms:  Procedure  Calls  Versus  Messages. 
Computer  15(4):  19-25,  April,  1982. 

Abstract 

Procedure  calls  and  messages  are  two  software  communication  techniques  in  wide  use  today.  Whereas  the 
semantics  of  the  procedure  call  are  well  known,  the  newness  and  variety  of  message  communication  make  it  leas 
understood. 

Furthermore,  the  terms  "procedure  calls’  and  "messages"  are  often  used  in  a  general  and  imprecise  manner,  and 
therefore  the  differences  between  them  tend  to  blur.  This  happens,  for  example,  when  the  claim  is  made  that 
messages  can  be  programmed  using  procedure  calls  •  a  claim  that  is  both  true  and,  in  fact,  reflects  what  is  often 
done  in  practice. 

[Staunstrup  82]  Staunstrup,  Jorgen. 

Message  Passing  Communication  Versus  Procedure  Call  Communication. 

*  Software-Practice  and  Experience  12(?):223-234, 1982. 

Abstract 

Communication  by  message  passing  or  by  procedure  calls  in  one  of  the  key  issues  when  discussing  languages  for 
multiprogramming.  The  two  languages  Platon  and  Concurrent  Pascal  represent  the  different  approaches  which 
are  contrasted  by  presenting  a  few  programs  written  in  both  languages. 

[Stritter  81  ]  Stritter,  Edward  P.,  Harry  J.  Saal  and  Leonard  J.  Shustek. 

Local  Networks  of  Personal  Computers. 

In  Digest  ol  Papers,  COMPCON  81  Spring,  pages  2-5.  IEEE,  1981. 

Abstract 

The  technologies  of  local  computer  networks  and  oi  ,.  joi.al  computers  are  beginning  to  interact.  Local 
networks  enhance  shanng  and  communication  in  a  computer  installation.  Personal  computers  make  significant 


dedicated  computer  power  available  to  the  user  at  a  cost  that  is  little  more  than  that  oi  a  terminal  connected  to  a 
more  traditional  large  system. 

A  commercially  available  local  computer  network  of  personal  computers  is  described  here.  The  system  combines 
the  advantages  of  personal  computers  (low  cost  per  user,  a  computer  on  every  desk,  etc.)  with  those  of  local 
computer  networks  (access  to  shared  resources,  cost  sharing  of  expensive  peripherals,  smooth  system  growth 
with  constant  compute  power  per  user.) 

Many  new  capabilities  derive  from  local  computer  networks  such  as  sharing  of  data,  computer-to-computer 
communication,  and  intelligent  server  resources  (shared  high-speed  printers,  file  systems,  data-base  backends, 
etc.)  This  paper  discusses  the  network  and  internetwork  configurations  which  make  such  capabilities  possible. 

[Stroet  80]  Stroet,  Jan. 

An  Alternative  to  the  Communication  Primitives  in  ADA. 

ACM-SIGPLAN  Notices  l5(l2):62-74,  December,  1980. 

Abstract 

A  critical  look  is  taken  at  the  ADA  communication  primitives  by  comparing  them  to  the  ITP  (Input  Tool  Process) 
model,  the  model  for  process  communiation  developed  at  Nijmegen.  The  comparison  is  'done  by  means  of 
example  solutions  to  several  problems  in  both  models.  It  is  shown  that  by  using  features  extracted  from  the  ITP 
model,  the  communication  facilities  in  ADA  could  be  improved  considerably  with  respect  to  orthogonality,  clarity, 
flexibility  and  power. 

[Stroustrup  79]  Stroustrup,  Bjame. 

An  Inter-Module  Communication  System  for  a  Distributed  Computer  System. 

In  Proceedings,  First  International  Conference  on  Distributed  Computing 
Systems,  pages  41 2-41 8.  IEEE,  October,  1979. 

Abstract 

This  paper  outlines  the  design  of  an  inter-module  communication  system  suitable  tor  a  computer  system 
consisting  of  many  separate  machines  communicating  via  a  local  communication  network.  This  mter-module 
communication  system  was  designed  to  support  the  SIMOS  operating  system  utilizing  a  number  of  such  machines 
to  provide  services  normally  provided  by  a  centralized  system.  Examples  of  how  "server''  machines  can  be  used 
to  run  the  SIMOS  file  system  are  presented  together  with  data  showing  the  effect  of  such  usage  on  the  overall 
system  performance. 

[Sunshine  76]  Sunshine,  Carl  A. 

Factors  in  Interprocess  Communication  Protocol  Efficiency  for  Computer 
Networks. 

In  Proceedings,  National  Computer  Conference,  pages  571-576.  AFIPS,  1976. 

Abstract 

This  paper  considers  the  efficiency  of  interprocess  communication  protocols  for  distributed  processing 
environments  such  as  computer  networks.  Previous  research  has  emphasized  system  performance  at  lower 
levels,  within  the  communication  medium  itself,  while  this  work  examines  requirements  and  performance  of 
protocols  for  communication  between  processes  in  the  Host  computers  attached  to  the  communication  system. 
Efficiency  primarily  concerns  throughput  and  delay  achievable  for  communication  between  remote  processes. 
Various  aspects  of  protocol  operation  are  analyzed,  and  protocol  policies  concerning  retransmission,  flow  control, 
buffering,  acknowledgment,  and  packet  size  emerge  as  the  most  important  factois  in  determining  efficiency. 
Several  graphs  showing  quantitative  performance  results  for  representative  situations  are  included. 

[Sunshine  78]  Sunshine,  Carl  A.  and  Yogen  K.  Dalai. 

Connection  Management  in  Transport  Protocols. 

Computer  Networks  2(3):454-473, 1978. 

Abstract 

Transport  protocols  are  designed  to  provide  fully  reliable  communication  between  processes  which  must 
communicate  over  a  less  reliable  medium  such  as  a  packet  switching  network  (which  may  damage,  lose,  or 
duplicate  packets,  or  deliver  them  out  ol  order).  This  is  typically  accomplished  by  assigning  a  sequence  number 
and  checksum  to  each  packet  transmitted,  and  retransmitting  any  packets  not  positively  acknowledged  by  the 
other  side.  The  use  of  such  mechanisms  requires  the  maintenance  of  state  information  describing  the  progress  of 
data  exchange.  The  initialization  and  maintenance  of  this  state  information  constitutes  a  connection  between  the 
two  processes,  provided  by  the  transport  protocol  programs  on  each  side  of  the  connection.  Since  a  connection 
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requires  significant  resources,  it  is  desirable  to  maintain  a  connection  only  while  processes  are  communicating. 
This  requires  mechanisms  for  opening  a  conneclion  when  needed,  and  for  closing  a  connection  after  ensuring 
that  all  user  data  have  been  properly  exchanged.  These  connection  management  procedures  form  the  main 
Subiect  of  this  paper.  Mechanisms  for  establishing  connections,  terminating  connections,  recovering  from 
crashes  or  failures  of  either  side,  and  for  resynchronizing  a  connection  are  presented.  Connection  management 
functions  are  intimately  involved  in  protocol  reliability,  and  if  not  designed  property  may  result  m  deadlocks  or  old 
data  being  erroneously  delivered  in  place  of  current  data.  Some  protocol  modeling  techniques  useful  in  analyzing 
connection  management  are  discussed,  using  verification  of  connection  establishment  as  an  example.  The  paper 
is  based  on  experience  with  the  Transmission  Control  Protocol  (TCP),  and  examples  throughout  the  paper  are 
taken  from  TCP. 

[Terada  80]  Terada,  M.,  J.  Kashio,  K.  Yokota,  Y.  Hori  and  H.  Fushimi. 

A  Network  Operating  System  for  High  Speed  Optical  Fiber  Loop  Transmission 
System. 

In  Proceedings,  Fifth  International  Conference  on  Computer  Communication: 
Increasing  Benefits  for  Society,  pages  641-646.  October,  1980. 

Abstract 

This  paper  describes  a  network  operating  system  (NOS)  developed  for  inhouse  computer  networks.  The  optical 
fiber  loop  system  with  a  high  transmission  speed  of  10  Mbits/sec  connects  some  of  the  minicomputers  and  the 
many  microcomputers.  The  NOS  is  designed  to  provide  the  following  three  features: 

1.  An  improved  interface  between  computer  and  transmission  control  equipment  to  achieve  efficient  data 
transfer. 

2.  Unified  access  interfaces  to  user  programs  with  respect  to  the  access  to  two  kinds  of  resources,  one 
being  locally  attached  peripherals  and  the  other  remote  devices.  Centralized  network  system 
maintenance  and  operation  functions  lor  distributed  mini/micro-  computers,  in  order  to  obtain  overall 
system  efficiency  and  cost  effectiveness. 

[Test  79]  Test,  Jack  A.  .  • 

An  Interprocess  Communication  Scheme  for  the  Support  of  Cooperating  Process 
Networks. 

In  Proceedings,  First  International  Conference  on  Distributed  Computing 
Systems,  pages  405*41 1 .  IEEE,  October,  1979. 

Abstract 

This  paper  describes  an  interprocess  communication  scheme  for  application  in  Distributed  Operating  System 
Environments.  This  scheme  is  based  upon  the  notions  of  gates  to  processes  and  connections  between  gates.  A 
gate,  as  conceived  here,  serves  as  a  standard  interface  between  the  internal  environment  of  a  process  and  its 
external  environment  A  connection  between  gates  allows  a  simplex  information  transfer  between  those  gates. 
The  proposed  IPC  primitives,  when  implemented  as  part  of  a  distributed  operating  system  kernel,  provide  some 
measure  of  built-in  fault  detection;  enforce  a  capability-like  scheme  for  gate  access  protection;  and  support 
multi-process  dialogues. 

[Thomas  76]  Thomas,  Robert  H.  and  Stuart  C.  Schaffner. 

MSG:  The  Interprocess  Communication  Facility  lor  the  National  Software  Works. 
Technical  Report  3483,- Bolt  Beranek  and  Newman  Inc.,  December,  1976. 

Abstract 

The  National  Software  Works  (NSW)  provides  software  implementers  with  a  suitable  environment  for  the 
development  of  programs.  This  environment  consists  of  many  software  development  tools  (such  as  editors, 
compilers,  and  debuggers),  running  on  a  variety  ol  computer  systems,  but  accessible  through  a  single  access¬ 
granting,  resource-  allocating  monitor  with  a  single,  uniform  tile  system.  By  its  very  nature,  the  NSW  consists  of 
processes  distributed  over  a  number  of  computers  connected  by  a  communications  network.  These  processes 
must  communicate  with  one  another  in  order  to  create  a  unified  system.  This  paper  describes  the  communication 
facility  (named  MSG)  which  was  developed  to  provide  interprocess  communication  lor  the  implementation  of  the 
NSW.  As  we  have  noted,  the  communication  network  is  currently  the  ARPANET.  However,  we  have  designed  the 
MSG  facility  to  be  as  independent  as  possible  of  the  ARPANET  implementation  so  that  the  concepts  may  corned 
over  to  implementation  on  other  networks. 
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[Tilborg  80]  Tilborg,  Andre'  M.  van  and  Larry  D.  Wittie. 

A  Concurrent  Pascal  Operating  System  For  a  Network  Computer. 

In  Proceedings,  COMPSAC  80,  pages  1-7.  IEEE,  October,  1980. 

Abstract 

A  network  computer  (multi-micro-processor,  modular  computer)  requires  both  a  high-level  control  structure  to 
bind  the  nodes  into  a  cohesive  computing  device  and  a  local  operating  system  lor  each  of  the  nodes  to  execute. 
This  paper  outlines  the  form  of  a  hierarchical  high-level  control  schema  and  describes  a  nodal  operating  system 
designed  and  built  for  the  MICRONET  network  computer  using  Concurrent  Pascal.  The  operating  system  consists 
of  a  packet  switching  subsystem  which  executes  in  the  communications  frontend  processor  of  each  node  and  a 
host  processor  operating  system  which  manages  local  resources,  interfaces  to  user  terminals,  and  executes  task 
forces.  The  host  processor  operating  system  customizes  itself  to  the  abilities  of  each  node  to  some  extent  by  not 
initializing  some  of  its  built-in  Concurrent  Pascal  system  components  depending  on  the  resources  available  at  the 
host  node.  However,  every  node  executes  a  process  which  supports  the  high-level  control  schema. 

[Tilborg  82]  Tilborg,  Andre'  M.  van. 

Packet  Switching  in  the  MICRONET  Network  Computer. 

IEEE  Transactions  on  Communications  COM -30(6):  1426- 1433,  June,  1982. 

Abstract 

Packet  switching  is  a  communication  technology  which  has  been  used  extensively  in  geographically  distributed 
computer  networks.  It  is  also  applicable  to  the  communication  subnetworks  of  compact  multimicrocomputers 
known  as  network  computers.  This  paper  describes  the  use  ol  the  language  Concurrent  Pascal  to  build  a 
packet-switching  subsystem  for  the  MICRONET  network  of  DEC  LSI-11  microcomputers.  Examples  of  actual 
Concurrent  Pascal  source  code  taken  from  the  system  demonstrate  the  usefulness  of  high-level  languages  with 
abstract  data  types  for  complex  communication  software. 


l*>  •*  *  * 


[Vervoort  80]  Vervoort,  W.  A. 

A  Taxonomy  of  Interprocess  Communication. 

Technical  Report,  Twente  University  of  Technology,  1980. 

Abstract 

A  classification  system  tor  interprocess  communication  in  Distributed  Systems  without  shared  variables  has  been 
developed  based  on  three  orthogonal  choises  with  respect  ot  the  measure  of  freedom  of  the  processes  in  their 
communication.  These  three  orthogonal  axis  construct  a  3-D  space  of  communication.  Eleven  example  models  of 
communication  found  in  literature  have  been  described  and  located  in  this  space.  Some  points  in  the  space 
remained  unoccupied.  Examples  of  the  model  closest  to  the  origin  (the  most  restneted-)  and  the  most  free 
communication  model  are  given  together  with  their  basic  concepts. 

[Walden  72]  Walden,  David  C. 

A  System  for  Interprocess  Communication  in  a  Resource  Sharing  Computer 
Network. 

Communications  of  the  ACM  15(4):221-230,  April,  1972. 

Abstract 

A  resource  sharing  computer  network  is  defined  to  be  a  set  of  autonomous,  independent  computer  systems, 
interconnected  to  permit  each  computer  system  to  utilize  all  the  resources  of  the  other  computer  systems  as  much 
as  it  would  normally  call  one  of  its  own  subroutines.  This  definition  of  a  network  and  the  desirability  of  such  a 
network  are  expounded  upon  by  Roberts  and  Wessler  in  [9],  Examples  of  resource  sharing  could  include  a 
program  filing  some  data  in  the  file  system  of  another  computer  system,  two  programs  in  remote  computer  systems 
exchanging  communications,  or  users  simply  utilizing  programs  of  another  computer  system  via  their  own. 


[Walden  75]  Walden,  David  C.  and  John  M.  McQuillan. 

Some  Consideration  for  a  High  Performance  Message-Based  Interprocess 
Communication  System. 

In  Proceedings,  SIGCOMM-SIGOPS  Interface  Workshop  on  Interprocess 
Communications,  pages  45-54.  ACM,  March,  1975. 

Abstract 

We  continue  to  be  concerned  with  interprocess  communications  systems  (such  as  those  described  in  references 
1,  2,  and  3  and  called  "thin-wire”  communications  systems  in  reference  .hich  are  suitable  lor  commmunication 
between  processes  that  are  not  co-located  in  the  same  operating  system  but  rather  reside  in  different  operating 
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systems  on  dilterent  computers  connected  by  a  computer  communications  network.  Futher,  the  systems  with 
which  we  are  concerned  are  assumed  to  communicate  using  addressed  messages  (e.g.,  reference  5)  which  are 
multiplexed  onto  the  logical  communications  channel  between  the  source  process  and  the  destination  process, 
rather  than  using  such  traditional  methods  as  shared  memory  (an  impossibility  tor  distributed  communicating 
processes)  or  dedicated  physical  communications  channels  between  pairs  of  processes  desiring  to  communicate 
(which  is  considered  to  be  impossibly  expensive). 

(Walton  82]  Walton,  Robert  L 

Rationale  for  a  Oueueable  Object  Distributed  Interprocess  Communication  System. 
IEEE  Transactions  on  Communications  COM-30(6):1417-1425,  June,  1982. 

Abstract 

We  consider  the  problem  at  designing  an  interprocess  communication  system  usable  as  a  base  for  writing 
real  time  operating  and  applications  systems  in  a  distributed  environment  where  processes  may  be  connected  by 
anything  from  shared  virtual  memory  to  radios.  By  requiring  an  interface  that  minimizes  the  code  an  application 
program  must  devote  to  communications,  a  facility  of  substantially  higher  level  than  basic  message  passing 
becomes  necessary  This  is  largely  a  consequence  of  four  mator  performance  problems  with  interprocess 
communication  in  a  distributed  environment:  system  reliability,  server  congestion,  throughput,  and  response  time. 
We  summarize  these  problems,  and  introduce  an  interprocess  communication  system  based  on  two  mechanisms: 
queueabie  obiects  and  connectable  obiects.  We  briefly  review  our  experience  with  a  limited  implementation  of 
queueabie  obiects. 

[Watson  80]  Richard  W.  Watson. 

Distributed  System  Architecture  Model. 

[use  the  ot'  er  one]. 

Abstract 

The  area  of  distributed  systems  is  new  and  not  well  defined.  The  purpose  of  this  chapter  is  to  provide  a  conceptual 
framework  for  organizing  the  discussion  of  distributed  system  design  goals,  issues,  and  interrelationships,  provide 
some  common  terminology  to  be  used  in  the  following  chapters,  and  provide  an  overview  of  some  common  design 
issues.  We  refer  to  this  framework  or  reference  architecture  as  the  Distributed  Systems  Architecture  Model  ot 
simply  as  the  Model.  The  remainder  of  the  book  elaborates  the  Model  and  presents  alternative  approaches  to  its 
realization.  Besides  serving  as  an  organizing  framework  for  the  material  of  this  book,  we  believe,  the  Model  is 
useful  m  the  design,  organization,  and  analysts  of  a  distributed  system.  The  Model  is  shown  in  figure  2.1 . 

The  Model  contains  three  dimensions.  The  vertical  dimension  represents  a  distributed  system  as  consisting  of  a 
set  of  logical  layers.  This  book  is  primarily  organized  according  to  the  categories  on  this  axis.  Each  layer  and 
sublayer  has  design  and  implementation  issues  unique  to  itself  as  well  as  a  range  of  issues  common  among  all  the 
layers.  These  common  issues  are  shown  as  a  second  dimension  on  the  horizontal  axis.  The  problems  presented 
by  each  common  issue  and  the  appropriate  solutions  to  it  may  differ  in  each  layer.  The  third  dimension,  shown 
perpendicular  to  the  page,  concerns  issues  reflecting  the  global  interaction  of  all  parts  of  a  destributed  system  on 
whole-system  implementation  and  optimization.  This  dimension  is  poorly  understood.  It  is  shown  here  primarily  as 
a  reminder  of  its  importance  and  the  need  lor  research  to  improve  our  understanding.  Each  of  these  dimensions  is 
discussed  in  detail  in  the  sections  to  follow. 

[Wecker  80]  Wecker,  Stuart. 

DNA:  The  Digital  Network  Architecture. 

IEEE  Transactions  on  Communications  COM-28(4):51 0-526,  April,  1980. 

Abstract 

Recognizing  the  need  to  share  resources  and  distribute  computing  among  systems,  computer  manufacturers  have 
been  designing  network  components  and  communication  subsystems  as  a  part  of  their  hardware/softwarc  system 
offerings.  A  manufacturer's  general  purpose  network  structure  must  support  a  wide  range  of  applications, 
topologies,  and  hardware  configurations.  The  Digital  Network  Architecture,  (DNA),  the  architectural  model  for  the 
OECnet  family  of  network  implementations,  has  been  designed  to  meet  the  specific  requirements  and  to  create  a 
communications  environment  among  the  heterogeneous  computers  comprising  Digital's  systems. 

This  paper  decribes  the  Digital  Network  Architecture,  including  an  overview  of  its  goals  and  structure,  and  details 
on  the  interfaces  and  functions  within  that  structure.  The  protocols  implementing  the  functions  of  DNA  are 
described,  including  the  motivations  for  the  specific  designs,  alternatives  and  tradeoffs,  and  lessons  learned  from 
the  implementations.  The  protocol  descriptions  include  discussions  of  addressing,  error  control,  flow  control, 
synchronization,  flexibility  and  performance.  The  paper  concludes  with  examples  of  DECnet  operation. 


[Wettstein  ??]  Wettstein,  H.  and  G.  Merbeth. 

The  Concept  of  Asynchronization. 

[unknown]. 

Abstract 

Communication  between  parallel  processes  may  take  place  in  synchronous  or  asynchronous  form.  The  former 
has  widely  been  used  in  various  concepts.  In  contrast,  means  for  asynchronous  process  relations  exist  only  in  a 
few  systems  in  rudimentary  form.  In  this  paper  the  concept  of  asynchronization  is  developed  systematically.  The 
underlying  data  structures  as  well  as  operations  upon  them  are  defined  for  various  versions. 

[Wittie  79]  Wittie,  Larry  D. 

A  Distributed  Operating  System  For  a  Reconfigurable  Network  Computer. 

In  Proceedings,  First  International  Conference  on  Distributed  Computing' 

Systems,  pages  669-677.  IEEE,  October,  1979. 

Abstract 

MICROS  is  the  distributed  operating  system  for  the  MICRONET  network  computer.  MICRONET  is  a  reconfigurable 
and  extensible  network  of  sixteen  loosely-coupled  LSI-1 1  microcomputer  nodes  connected  packet-switching 
interlaces  to  pairs  of  high-speed  shared  communication  buses.  MICROS  simultaneously  supports  many  users, 
each  running  multicomputer  parallel  programs.  MICROS  is  intended  for  control  of  network  computers  of  up  to  ten 
thousand  nodes. 

Each  network  node  is  controlled  by  a  private  copy  of  the  MICROS  kernel  processes  written  in  Concurrent  Pascal. 
Resource  management  tasks  are  distributed  over  the  network  in  a  control  hierarchy.  Management  and  user 
program  tasks  are  Sequential  Pascal  programs  dynamically  loaded  into  the  nodes.  Whether  in  the  same  or 
different  nodes,  tasks  communicate  via  a  uniform  message  passing  system.  The  MICROS  command  language 
allows  spawning  of  groups  of  communicating  tasks.  Concurrent  Pascal  will  eventually  be  provided  for  users 
writing  parallel  programs  for  MICRONET. 

[Wulf  8i]  Wulf,  William  A.,  Roy  Levin  and  Samuel  P.  Harbison. 

Hydra/C. mmp:  An  Experimental  Computer  System. 

McGraw/Hill,  New  York,  1981. 

Abstract 

An  operating  system  that  encourages  the  use  of  cooperating  sequential  processes  has  a  dual  responsibility.  On 
the  one  hand,  it  must  provide  protection  mechanisms  to  insulate  processes  from  one  another  so  that  erroneous  or 
malicious  behavior  on  the  part  of  one  cannot  interfere  with  unrelated  ones.  On  the  other  hand,  it  must  also  provide 
mechanisms  for  cooperation  among  the  processes  working  on  a  common  task.  The  last  two  chapters  have  dealt 
with  some  aspects  of  Hydra's  response  to  the  first  of  these  responsibilities.  In  this  chapter  we  shall  deal  with  one 
aspect  of  the  second. 

Within  the  Hydra  context,  a  wide  range  of  interaction  mechanisms  are  possible,  from  tightly  coupled  memory 
sharing  to  loosely  coupled  message  communication.  Moreover,  the  user  is  free  to  define  application-specific 
mechanisms  that  lie  anywhere  along  this  spectrum.  The  Hydra  Message  System  is  a  particular  communication 
facility  which  we  believe  is  convenient  for  many  loosely  coupled  applications,  and  which  can  form  the  basis  for 
many  others. 

[Xerox  81  ]  Xerox  Corporation. 

Courier:  The  Remote  Procedure  Call  Protocol. 

Technical  Report  XS1S  038112,  Xerox  Corporation,  December,  1981. 

Abstract 

One  of  the  communication  disciplines  most  frequently  used  by  distributed  system  builders  is  that  in  which  a 
request  for  service  and  its  reply  are  exchanged  by  two  system  elements:  a  service  provider  and  a  service 
consumer.  Courier,  the  Network  System  (NS)  Remote  Procedure  Call  Protocol,  facilitates  the  construction  of 
distributed  systems  by  defining  a  single  request/reply  or  transaction  discipline  for  an  open-ended  set  of 
higher-level  application  protocols.  Courier  standardizes  the  format  of  request  and  reply  messages  and  the 
network  representations  tor  a  family  of  data  types  from  which  request  and  reply  parameters  can  be  constructed. 

Not  all  network  communication  is  transaction-oriented.  For  example,  the  exchange  of  control  information  that 
typically  precedes  the  transfer  of  a  file  between  system  elements  might  model  naturally  as  a  transaction.  However, 
the  transfer  of  the  file's  contents  is  more  appropriately  modeled  as  bulk  data  transfer. 

Not  all  transaction-oriented  communication  is  best  accomplished  using  Courier.  For  example,  the  interrogation  of 
a  directory  of  network  resources  to  locate  a  named  resource  might  model  naturally  as  a  transaction.  However, 
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satisfying  the  performance  requirements  for  that  operation  might  necessitate  the  use  of  datagrams,  rather  than 
virtual  circuits  (upon  which  Courier  is  based). 

Other  NS  protocols- -for  example,  the  Sequenced  Packet  Protocol  and  the  Internet  Datagram  Protocol  [S]— support 
applications  for  which  Courier  is  inappropriate. 

[Zelkowitz  75]  Zelkowitz,  Marvin  V. 

A  Proposal  in  Process  Hierarchy  and  Network  Communication. 

In  Proceedings,  SIGCOMM-SIGOPS  Interface  Workshop  on  Interprocess 
Communications,  pages  154-158.  ACM,  March,  1975. 

Abstract 

Network  design  was  once  a  complex  art  that  few  understood  but  it  is  slowly  becoming  a  science  where  many  of  the 
fundamental  ideas  are  crystallizing  into  a  set  of  basic  axioms.  The  purpose  of  this  note  is  to  present  one  set  of 
ideas  and  show  how  they  can  be  developed  into  a  reliable  system.  The  system  will  be  hierarchically  structured  and 
has  a  powerful  protection  mechanism  that  allows  for  reliable  system  operation. 

[Zimmermann  81]  Zimmermann,  Hubert,  Jean-Serge  Banino,  Alain  Caristan,  Marc  Guillemont  and 
Gerard  Morisset. 

Basic  Concepts  for  the  Support  of  Distributed  Systems:  The  CHORUS  Approach. 

In  Proceedings,  Second  International  Conference  on  Distributed  Computing 
Systems,  pages  60-66.  IEEE,  April,  1981. 

Abstract 

Distribution  brings  completely  new  requirements  tor  processing,  synchronization,  communication,  protection,  and 
engineering  of  distributed  applications.  The  CHORUS  architecture  proposes  a  new  approach  to  meet  these 
requirements;  the  paper  introduces  a  set  of  basic  concepts  and  mechanisms  essentially  focused  on  run-time 
aspects  of  distributed  systems. 


A. 3.  Hardware  Support  for  Operating  Systems  Architectures 

[ACM  82]  ACM. 

Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems. 

ACM,  Palo  Alto,  California,  1982. 

This  Proceedings  contains  many  papers  on  actual  and  proposed  computer  systems 
which  incorporate  some  form  of  hardware  support  for  operating  systems. 

[Ahuja.S  82]  Ahuja,  S.R.  and  Asthana,  A. 

A  Multi-Microprocessor  Architecture  with  Hardware  Support  for  Communication 
and  Scheduling. 

In  Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems,  pages  205-209.  ACM,  March,  1982. 

A  functionally  partitioned  multiprocessor  system  is  described.  A  separate 

processor  is  provided  to  handle  communication  and  scheduling  functions,  and 
there  are  special  processors  for  handling  I/O.  The  signalling  and  scheduling 
processor  balances  the  load  among  the  multiple  execution  units. 

[Ames.S  83]  Ames,  S.R.,  Jr.,  Gasser,  M.,  and  Schell,  R.R. 

Security  Kernel  Design  and  Implementation:  An  Introduction. 

IEEE  Computer  16(7):14-22,  July,  1983. 

The  security  kernel  is  currently  the  most  popular  approach  to  designing  very 
secure  computer  systems.  It  is  defined  as  the  hardware  and  software  which 
provides  a  “reference  monitor"  abstraction  such  that  every  reference  to 
information  or  change  of  authorization  must  pass  through  the  monitor.  The 
authors  provide  a  good  overview  of  the  security  kernel  approach,  and  indicate 
that  the  hardware  support  required  for  efficient  implementation  of  a  security 
kernel  is  quite  modest,  and  actually  found  on  many  commercially  available 
systems. 

[Anderson. G  75]  Anderson,  G.A.  and  Jensen,  E.D. 

Computer  Interconnection  Structures:  Taxonomy,  Characteristics,  and  Examples. 

•  Computing  Surveys  7(4):197-213,  December,  1975. 

Anderson  and  Jensen  describe  a  taxonomy,  or  naming  scheme,  for  systems  of 
interconnected  computers.  This  taxonomy  is  very  useful  for  characterizing 
different  system  designs,  and  it  provides  a  common  context  in  which  to 
compare  them. 

[Anderson.J  72]  Anderson,  J.A.  and  Lipovski,  G.J. 

A  Cellular  Processor  for  Task  Assignments  in  Polymorphic,  Multiprocessor 
Computers. 

In  Proc.  of  Fall  Joint  Computer  Conf.,  pages  703-708.  AFIPS,  December,  1972. 

Anderson  and  Lipovski  describe  how  an  associative  memory  which  does  threshold 
matching  can  be  used  to  quickly  determine  which  job  requests  can  be  satisfied 
with  current  system  resources. 


[Anderson. L  75]  Anderson,  L.D. 

Disciplined  Software  Development  Utilizing  a  Hardware-Structured  Executive. 

In  Proc.  of  EASCON ,  IEEE,  September,  1975. 

Anderson  describes  a  system  in  which  process  descriptor  segments  are  created, 
manipulated,  and  destroyed  by  hardware  primitives,  and  process  switching  is 
carried  out  automatically  as  part  of  the  basic  machine  cycle.  Virtual  address 
mapping  is  also  aided  by  an  associative  memory  containing  the  process  ID 
along  with  the  segment  number.  Using  the  process  ID  avoids  having  to  switch 
memory  mapping  registers  on  process  switches. 

[Applewhi.H  79]  Applewhite,  H.L.,  Arnold,  R.G.,  Gorman,  T.J.,  Gouda,  M.G.,  and  Marks,  C.P: 

Modular  Missile  Borne  Computer  (MMBC)  Software  Structure  and  Implementation. 

in  Proc.  of  1st  Int.  Conf.  on  Distributed  Computing  Systems ,  pages  725-735.  IEEE, 
October,  1979. 

MMBC  is  designed  to  support  pipelined  process  structures,  where  each  stage  of 
the  pipeline  can  have  replicated  processes  implementing  it.  Such  structures  are 
felt  to  be  very  useful  (and  common)  in  high  performance  real  time  systems. 

[Applewhi.H  80]  Applewhite,  H.L.,  Garg,  R.,  Jensen,  E.D.,  Northcutt,  J.D.,  Sha,  L.,  and  Wendorf,  J.W. 

Distributed  Computer  Systems:  Fiscal  Year  Interim  Report  to  Rome  Air 
Development  Center,  October  1980. 

Carnegie-Mellon  University,  Computer  Science  Department.  1980. 

This  report  contains  the  initial  paper  on  ArchOS,  discussing  some  of  the  issues  in 
the  design  of  an  operating  system  for  a  distributed  computer  system. 

[Arden. B  81]  Arden,  B.W.  and  Ginosar,  R. 

MP/C:  A  Multiprocessor  /  Computer  Architecture. 

In  Proc.  of  8th  Annual  Symp.  on  Computer  Architecture,  pages  3-19.  IEEE  and 
ACM,  May,  1981. 

MP/C  is  a  dynamically  partitionable  multiprocessor  system.  A  fast  FORK  operation 
is  supported  by  partitioning  the  shared  bus  such  that  one  processor  (and 
associated  process)  is  active  in  each  partition.  Adjacent  partitions  can  later  be 
JOlNed,  leaving  one  processor  active  in  the  combined  partition. 

[Atkinson.T  75]  Atkinson,  T.D.,  Gagliardi,  U.O.,  Raviola,  G.,  and  Schwenk,  H.S.,  Jr. 

Modern  Central  Processor  Architecture. 

Proc.  of  the  IEEE  63(6):863-870,  June,  1975. 

The  operating  system  mechanisms  supported  in  hardware  by  the  Honeywell  Series 
60  Level  64  are  described.  The  Level  64  provides  automatic  queueing  and 
priority  dispatching  of  processes,  semaphores  with  and  without  messages,  and 
a  segmented  virtual  memory  system  with  hardware  protection  rings.  All  I/O  is 
handled  through  semaphores. 

[Bal.S  82]  Bal,  S.,  Kaminker,  A.,  Lavi,  Y.,  Menachem,  A.,  and  Soha,  Z. 

The  NS16000  Family  •  Advances  in  Architecture  and  Hardware. 

IEEE  Computer  15(6):58-67,  June,  1982. 

The  MSI  6032  MPU  is  a  32-bit  microprocessor  with  low  level  support  for 
semaphores,  and  state  saving  on  process  switches.  The  NS16082  MMU 
provides  virtual  address  translation  and  memory  protection. 
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[Balzer.R  73]  Balzer,  R.M. 

An  Overview  of  the  ISPL  Computer  System  Design. 

Communications  of  the  ACM  16{2):1 17-122,  February,  1973. 

Balzer  stresses  the  advantages  of  concurrent  design  of  the  programming  language, 
operating  system,  and  machine  architecture  of  a  computing  system.  The  ISPL 
computer  system  provides  support  in  microcode  for  process  scheduling, 
memory  allocation,  and  a  port  mechanism  for  uniform  communication  with  files, 
devices,  and  processes. 

[Barton.G  82]  Barton,  G.C. 

Sentry:  A  Novel  Hardware  Implementation  of  Classic  Operating  System 
Mechanisms. 

In  Proc.  of  9th  Annual  Symp.  on  Computer  Architecture,  pages  140-147.  IEEE  and 
ACM,  April,  1982. 

The  Sentry  is  a  hardware  memory  protection  mechanism.  It  monitors  activity  on  the 
system  bus  and  blocks  those  references  which  are  not  permitted  in  the  active 
process. 

[Bell.C  82]  Bell,  C.G.,  Newell,  A.,  Reich,  M„  and  Siewiorek,  D.P. 

The  IBM  System/360,  System/370, 3030,  and  4300:  A  Series  of  Planned  Machines 
that  Span  a  Wide  Performance  Range. 

In  Siewiorek,  D.P.,  Bell,  C.G.,  and  Newell,  A.,  editor,  Computer  Structures: 
Principles  and  Examples,  pages  856-892.  McGraw-Hill,  1982. 

This  article  provides  a  good  survey  of  the  range  of  IBM  System/360,  System/370, 
and  follow-on  machines.  The  distinguishing  characteristics  and  extra  options 
for  the  various  models  are  all  briefly  examined.  A  number  of  models  are  found 
to  include  various  types  and  levels  of  support  for  operating  system  functions. 

[Berenbau.A  82]  Berenbaum,  A.D.,  Condry,  M.W.,  and  Lu,  P.M. 

The  Operating  System  and  Language  Support  Features  of  the  BELLMAC-32 
Microprocessor. 

In  Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems,  pages  30-38.  ACM,  March,  1982. 

The  BELLMAC-32  is  a  32-bit  microprocessor  which  provides  a  number  of 
mechanisms  for  making  operating  system  implementation  easier.  It 
"understands"  process  control  blocks,  providing  instructions  for  switching 
processes.  I/O  interrupts  cause  automatic  process  switches. 

[Berg.R  71  ]  Berg,  R.O.  and  Thurber,  K.J. 

A  Hardware  Executive  Control  for  the  Advanced  Avionic  Digital  Computer  System. 

In  Proc.  of  National  Aerospace  Electronics  Conf.,  pages  206-213.  IEEE,  May,  1971. 

The  AADC  is  a  real  time,  multiprocessor  system  containing  a  special  hardware  unit 
called  Master  Executive  Control  (MEC).  The  MEC  provides  all  executive  control 
for  the  system,  handles  interrupts,  and  does  all  scheduling  on  the  basis  of 
priority  and  importance  criteria.  The  MEC  uses  an  associative  memory  to  aid  in 
resource  management  by  making  searches  for  status  information  very  fast. 
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[BerndtH  76]  Berndt,  H. 

Evolutionary  Computer  Architecture:  The  Unidata  7.000  Series. 

Computer  Architecture  News  5(1  ):1 0- 1 6,  April,  1976. 

In  the  Unidata  7.000  Series,  process  switching  is  aided  by  Store  Status  of  Program 
(SSP)  and  Load  Status  of  Program  (LSP)  instructions  which  save  and  restore 
the  process  state  registers. 

[Bernhard.R  81]  Bernhard,  R. 

More  Hardware  Means  Less  Software. 

IEEE  Spectrum  18(12):30-37,  December,  1981. 

Bernhard  provides  a  good,  balanced  introduction  to  the  Reduced  Instruction  Set 
Computer  (RISC)  versus  Complex  Instruction  Set  Computer  (CISC)  controversy. 

[Blaauw.G  64]  Blaauw,  G.A.  and  Brooks,  F.P.,  Jr. 

The  Structure  of  System/360,  Part  I:  Outline  of  the  Logical  Structure. 

IBM  Systems  Journal  3(2):  1 19-135, 1964. 

Reprinted  in  Siewiorek  et  al.,  Computer  Structures:  Principles  and  Examples, 
McGraw-Hill,  1982,  pp.  695-706. 

[Boebert.W  77]  Boebert,  W.E.,  Bonneau,  C.H.,  and  Camall,  J.J. 

Secure  Computing. 

In  Proc.  of  Symp.  on  Trends  and  Applications  1977:  Computer  Security  and 
Integrity,  pages  49-63.  IEEE  and  NBS,  May,  1977. 

The  “Secure  Communications  Processor"  (SCOMP)  is  discussed,  both  in  terms  of 
the  underlying  design  issues  and  the  actual  implementation.  SCOMP  supports 
multilevel  security  through  special  purpose  software  running  on  a  modified  and 
enhanced  Honeywell  Level  6  minicomputer.  A  special  hardware  Security 
Protection  Module  (SPM)  mediates  all  processor  to  memory,  processor  to 
device,  and  device  to  memory  interactions. 

[Boebert.W  78a]  Boebert,  W.E.,  Franta,  W.R.,  Jensen.  E.D.,  and  Kain,  R.Y. 

Decentralized  Executive  Control  in  Distributed  Computer  Systems. 

In  Proc.  of  COMPSAC  '78,  pages  254-258.  IEEE,  November,  1978. 

.  This  paper  discusses  the  issues  and  requirements  involved  in  the  design  of  the 
decentralized  executive  for  a  distributed  computer  system  (HXDP). 

[Boebert.W  78b]  Boebert,  W.E.,  Franta,  W.R.,  Jensen,  E.D.,  and  Kain,  R.Y. 

Kernel  Primitives  of  the  HXDP  Executive. 

In  Proc.  of  COMPSAC  '78,  pages  595-600.  IEEE,  November,  1978. 

The  HXDP  Executive  is  primarily  just  a  communication  kernel.  The  structure  of 
processes  (virtual  processors)  and  the  communication  mechanisms  (ports)  are 
described. 

[Boehm.B  83]  Boehm,  B.W. 

The  Hardware  /  Software  Cost  Ratio:  Is  It  a  Myth? 

IEEE  Computer  16(3):78-80,  March,  1983. 

Boehm  responds  to  Cragon’s  claim  that  the  hardware  /  software  cost  ratio  is  a 
myth  by  pointing  out  that  one  must  be  careful  about  the  situations  in  which  it  is 
applied.  For  the  entire  United  States,  the  law  seems  to  hold.  However  there  are 
a  number  of  more  narrow  contexts  in  which  the  law  should  not  be  applied. 
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Brandwajn,  A.t  Hernandez,  J.A.,  Joly,  R.,  and  Kruchten,  Ph. 

Overview  of  the  ARCADE  System. 

In  Pro c.  of  6th  Annual  Symp.  on  Computer  Architecture,  pages  42-49.  IEEE  and 
ACM.  April,  1979. 

ARCADE  is  a  multiprocessor  system  with  each  terminal  attached  to  its  own  "slow" 
processor,  which  handles  most  operating  system  tasks.  Application  processes 
are  assigned  dedicated  "fast"  processors  and  memory  modules,  selected  from 
a  pool  of  such  components.  Hardware  implemented  resource  allocation  fists 
are  provided  to  the  slow  processors  to  aid  in  allocating  the  fast  processors  and 
memory  modules  among  the  application  processes. 

Broadbent,  J.K.  and  Couiouris,  G.F. 

MEMBERS  -  A  Microprogrammed  Experimental  Machine  With  a  Basic  Executive  for 
Real-Time  Systems. 

SIGPLAN  Notices  9(8):  154- 160,  August,  1974. 

l{Proc.ofACMSIGPLAN-SIGMICROInterfaceMeeting 

Brown,  G.E.,  Eckhouse,  R.H.,  Jr.,  and  Estabrook,  J. 

Operating  System  Enhancement  Through  Firmware. 

In  Proc.  of  MicrolO:  10th  Annual  Workshop  on  Microprogramming ,  pages  119-133. 
IEEE  and  ACM,  October,  1977. 

The  paper  looks  at  the  improvements  possible  through  implementing  parts  of  the 
operating  system  nucleus  in  microcode.  Oueue  manipulation  and  semaphores 
are  particular  mechanisms  that  are  investigated.  A  model  of  a  simple  • 
timesharing  system  shows  that  a  70  percent  reduction  in  nucleus  execution 
time  will  result  in  about  a  25  percent  reduction  in  response  time. 

Budzinski,  R.L.,  Linn,  J„  and  Thatte,  S.M. 

A  Restructurable  Integrated  Circuit  for  Implementing  Programmable  Digital 
Systems. 

IEEE  Computer  15(3):43-54,  March,  1982. 

The  RIC  chip  contains  four  16-bit  processor  slices  which  can  be  connected  in 
various  ways.  One  possibility  is  to  use  two  of  the  processors  in  lockstep  to  form 
a  32-bit  application  processor  while  the  other  two  processor  slices  are  used  for 
operating  system  and  I/O  processing. 

Burkhardt,  W.H.  and  Randel,  R.C. 

Design  of  Operating  Systems  with  Micro -Prog rammed  Implementation. 

Technical  Report  PIT-CS-BU-73-01,  Univ.  of  Pittsburgh,  Computer  Science  Dept., 
September,  1973. 

Also  available  as  NTIS  Report  PB- 224-484. 

Buzen,  J.P.  and  Gagliardi,  U.O. 

The  Evolution  of  Virtual  Machine  Architecture. 

In  Proc.  of  National  Computer  Conf.,  pages  291-299.  AFIPS,  June,  1973. 

Buzen  and  Gagliardi  survey  the  hardware  and  software  methods  which  have  been 
employed  to  support  virtual  machines  on  existing  "third  generation" 
architectures. 


[Case.R  78]  Case,  R.P.  and  Padegs,  A. 

Architecture  of  the  IBM  System/370. 

Communications  of  the  ACM  21(1):73-96,  January,  1978. 

Reprinted  in  Siewiorek  et  al.,  Computer  Structures:  Principles  and  Examples, 
McGraw-Hill,  1982,  pp.  830-855. 

[Cheriton.D  79]  Cheriton,  D.R.,  Malcolm,  M.A.,  Melen,  L.S.,  and  Sager,  G.R. 

Thoth,  a  Portable  Real-Time  Operating  System. 

Communications  of  the  ACM  22(2):  105-1 15,  February,  1979. 

The  primary  concepts  and  facilities  of  the  Thoth  real  time  operating  system  are 
described.  Process  structuring  of  programs  is  emphasized.  The  communication 
mechanism  only  provides  for  synchronous  sends  of  messages  to  the  single 
mailbox  associated  with  a  receiving  process. 

[Clark.D  80]  Clark,  D.W.  and  Strecker,  W.D. 

Comments  on  “The  Case  for  the  Reduced  Instruction  Set  Computer,"  by  Patterson 
and  Ditzel. 

Computer  Architecture  News  8{6):34-38,  October,  1980. 

Clark  and  Strecker  respond  to  the  paper  by  Patterson  and  Ditzel,  pointing  out  a 
number  of  weaknesses  in  the  arguments  which  they  gave  in  favor  of  reduced 
instruction  set  computers.  Clark  and  Strecker  believe  that  it  will  be  very  difficult 
to  compare  RISC  and  and  CISC  architectures  without  actually  building  a 
complete  RISC  system,  including  the  operating  system,  and  evaluating  it  over  a 
wide  spectrum  of  real  applications. 

[Colwell.R  83]  Colwell,  R.P.,  Hitchcock,  C.Y.,  III,  Jensen,  E.D. 

Peering  Through  the  RISC/CISC  Fog:  An  Outline  of  Research. 

Computer  Architecture  News  1 1(1):44*50,  March,  1983. 

The  authors  propose  two  studies  designed  to  shed  more  light  on  the  current 
RISC/CISC  debate.  First  they  want  to  separate  out  the  performance 
degradation  caused  by  object  orientation  overhead,  from  degradation  caused 
by  complexity  of  the  instruction  set  itself  in  machines  such  as  the  iAPX  432.  The 
second  study  is  to  separate  the  performance  gains  due  to  multiple  register  set 
techniques,  from  those  resulting  from  the  reduced  complexity  of  the  instruction 
set  itself  in  RISC  machines. 

[Copeland.G  82]  Copeland,  G.P. 

What  If  Mass  Storage  Were  Free? 

IEEE  Computer  15(7):27-35,  July,  1982. 

Copeland  investigates  the  possible  advantages  of  a  nondeletion  strategy  for  a  mass 
storage  system,  including  increased  functionality  through  access  to  past  states, 
and  improved  system  performance  through  avoidance  of  garbage  collection, 
reduced  need  for  checkpoints,  and  reduced  need  for  locking. 

[Cragon.H  82]  Cragon,  H.G. 

The  Myth  of  the  Hardware  /  Software  Cost  Ratio. 

IEEE  Computer  1 5(1 2):1 00- 1 01 ,  December,  1982. 

Cragon  questions  the  “folk  law"  which  states  that  today  software  costs  are  two  to 
four  times  the  cost  of  hardware.  He  cites  a  number  of  studies  in  supporting  his 
contention  that  the  cost  of  software  is  high,  but  less  than  the  cost  of  hardware. 


[Dahlby.S  78] 


[Dannenbe.R  79] 


[DeBruijn.N  67] 


[DeMartin.M  76] 


[Denning. P  68] 


[Denning. P  80a] 


Dahlby,  S.H.,  Henry,  G.G.,  Reynolds,  D.N..  and  Taylor,  P.T. 

System/38:  A  High  Level  Machine. 

In  IBM  Syslem/38:  Technical  Developments,  pages  47-50.  IBM  GS80-0237, 1978. 

Reprinted  in  Siewiorek  et  al.,  Computer  Structures:  Principles  and  Examples, 
McGraw-Hill,  1982,  pp.  533-536. 

Dannenberg,  R.B. 

An  Architecture  with  Many  Operand  Registers  to  Efficiently  Execute  Block- 
Structured  Languages. 

In  Proc.  of  6th  Annual  Symp.  on  Computer  Architecture,  pages  50-57.  IEEE  and 
ACM,  April,  1979. 

Dannenberg  discusses  a  number  of  techniques  for  using  many  registers  to  hold  the 
variables  of  a  program.  However,  there  is  no  discussion  of  the  problems  such 
large  numbers  of  registers  cause  for  process  switching. 

DeBruijn,  N.G. 

Additional  Comments  on  a  Problem  in  Concurrent  Programming  Control. 

Communications  of  the  ACM  10(3):137-138,  March,  1967. 

DeBruijn  modifies  Knuth’s  solution  to  the  critical  section  mutual  exclusion  problem 
so  that  an  individual  process  is  guaranteed  access  to  its  critical  section  within 
N(N-1)/2  turns. 

DeMartinis,  M„  Lipovski,  G.J.,  Su,  S.Y.W,  and  Watson,  J.K. 

A  Self  Managing  Secondary  Memory  System. 

In  Proc.  of  3rd  Annual  Symp.  on  Computer  Architecture,  pages  186-194.  IEEE  and 
ACM,  January,  1976. 

The  authors  show  how,  by  adding  associative  hardware  to  serial  memory  devices, 
the  file  system  can  become  self  managing  in  that  no  directories  need  be  kept 
and  garbage  collection  and  storage  allocation  can  be  provided  automatically. 

Denning,  P.J. 

The  Working  Set  Model  for  Program  Behavior. 

Communications  of  the  ACM  1 1(5):323-333,  May,  1968. 

This  is  the  first  paper  discussing  the  working  set  model  for  memory  management. 
The  working  set  of  a  process  is  the  set  of  pages  referenced  by  that  process  in  a 
given  "window"  of  virtual  time.  A  process  will  not  be  allowed  to  execute  unless 
all  of  its  working  set  can  fit  in  main  memory.  In  this  way  the  load  on  the 
processor  is  automatically  controlled,  and  thrashing  is  avoided.  Denning 
discusses  two  implementations  of  the  working  set  model,  first  assuming  only 
that  a  "use  bit"  is  associated  with  each  page  in  memory,  and  second  assuming 
that  a  timer  is  associated  with  each  page. 

Denning,  P.J. 

Why  Not  Innovations  in  Computer  Architecture? 

Computer  Architecture  News  8(2):4-7,  April,  1980. 

Denning  laments  the  fact  that  proven  techniques  such  as  virtual  storage 
management,  among  others,  are  not  (properly)  incorporated  in  most 
commercial  architectures,  in  spite  of  convincing  demonstrations  of  their  value. 
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[Denning. P  80b)  Denning,  P.J.  and  Dennis,  T.D. 

On  Minimizing  Contention  at  Semaphores. 

Computer  Architecture  News  8(2):12-19,  April,  1980. 

The  use  of  tagged  memory  and  microprogrammed  operations  are  explored  as 
means  of  keeping  the  holding  times  of  semaphores  and  the  process  ready  list 
manipulations  to  a  minimum  in  multiprocessor  systems. 


[Denning.  P  82] 


[Dijkstra.E  65] 


[Dijkstra.E  68] 


[Ditzel.D  80a] 


[Ditzel.D  80b] 


Denning,  P.J. 

Are  Operating  Systems  Obsolete? 

Communications  ol  the  ACM  25(4):225-227,  April,  1982. 

Denning  argues  that  the  principal  concepts  of  operating  systems  can  be  grouped 
into  five  broad  classes:  Process  Coordination,  Virtual  Memory,  File  System, 
Device  Independence,  and  Job  Control.  Furthermore,  these  principal  concepts 
will  not  become  obsolete  in  the  near  future. 


Dijkstra,  E.W. 

Solution  of  a  Problem  in  Concurrent  Programming  Control. 

Communications  of  the  ACM  8(9):569,  September,  1965. 

Dijkstra  shows  how  mutually  exclusive  access  to  the  critical  section  in  each  of  N 
concurrent,  sequential  processes  can  be  ensured,  assuming  only  that 
indivisible  read  and  write  operations  on  the  primary  memory  are  available. 


Dijkstra,  E.W. 

The  Structure  of  the  “THE"  -  Multiprogramming  System. 

Communications  of  the  ACM  1 1  (5):341  -346,  May,  1968. 

The  THE  operating  system  was  structured  as  a  hierarchy  of  nested  abstract 
machines.  The  hierarchy  was  implemented  as  a  series  of  layers  of  software, 
each  extending  the  instruction  set  of  the  machines  below  it,  and  hiding  the 
details  of  its  internal  structure  from  the  levels  above. 


Ditzel,  D.R.  and  Patterson,  D.A. 

Retrospective  on  High-Level  Language  Computer  Architecture. 

In  Pro c.  of  7th  Annual  Symp.  on  Computer  Architecture,  pages  97-104.  IEEE  and 
ACM,  May,  1980. 

Ditzel  and  Patterson  argue  for  paying  more  attention  to  developing  a  High  Level 
Language  Computer  System  (HLLCS),  rather  than  just  a  high  level  language 
computer.  They  list  the  attributes  of  a  HLLCS,  one  of  which  is  support  for 
operating  systems. 


Ditzel,  D.R.  and  Kwinn,  W.A. 

Reflections  on  a  High  Level  Language  Computer  System  or  Parting  Thoughts  on 
the  SYMBOL  Project. 

In  Proc.  of  Int.  Workshop  on  High-Level  Language  Computer  Architecture,  pages 
80-87.  Dept,  of  Computer  Science,  Univ.  of  Maryland,  May,  1980. 

Ditzel  and  Kwinn  comment  on  various  aspects  of  the  SYMBOL  System.  The 
hardware  implemented  operating  system  was  very  successful  from  a 
performance  and  programming  standpoint,  but  while  software  costs  were 
reduced,  overall  costs  were  not. 


A-73 


V. 

/  */"V  “.*•  V  „*• 

.  -  O  '  .  -  «_  *  «L  < 


,*•  .**  *  *  ^  »•'  .* 


£8 


\'K-r 


V.M 


.V.V 


.A‘\ 


-•  .V.-S-l 


»  *.-  • 


- . 


vaa  * 


[Ditzel.D  82] 


[Eads.W  82] 


[Eisenber.M  72] 


[Erwin  .J  70] 


[Fabry.R  74] 


[Fancott.T  77] 


r  vjrrrvr*:  r~*' 


I 


Ditzel,  D.R.  and  McLeltan,  H.R. 

Register  Allocation  for  Free:  The  C  Machine  Stack  Cache. 

In  Proc.  of  Symp.  on  Architectural  Support  tor  Programming  Languages  and 
Operating  Systems,  pages  48-56.  ACM,  March,  1982. 

The  stack  cache  mechanism  improves  the  speed  of  subroutine  calls  and  access  to 
most  operands.  Unfortunately,  process  switching  time  is  increased  since  the 
entire  cache  register  file  must  be  saved  and  restored. 

Eads,  W.O.,  Walden,  J.M.,  and  Miller,  E.L. 

A  Dual-Processor  Desk  Top  Computer:  The  HP  9845A. 

In  Siewiorek,  D.P.,  Bell,  C.G.,  and  Newell,  A.,  editor,  Computer  Structures: 
Principles  and  Examples,  pages  508-532.  McGraw-Hill,  1982. 

The  HP  9845A  contains  two  main  processors,  a  Language  Processing  Unit  for 
interpreting  BASIC  programs,  and  a  Peripheral  Processing  Unit  for  handling 
I/O  and  most  management  functions  normally  associated  with  an  operating 
system. 


Eisenberg,  M.A.  and  McGuire,  M.R. 

Further  Comments  on  Dijkstra’s  Concurrent  Programming  Control  Problem. 

Communications  of  the  ACM  15(11):999,  November,  1972. 

Eisenberg  and  McGuire  improve  upon  DeBruijn’s  and  Knuth's  solutions  to  the 
critical  section  mutual  exclusion  problem  so  that  an  individual  process  is 
guaranteed  access  to  its  critical  section  within  N-1  turns. 

Erwin,  J.D.  and  Jensen,  E.D. 

-  Interrupt  Processing  with  Queued  Content-Addressable  Memories. 

In  Proc.  of  Fall  Joint  Computer  Con/.,  pages  621-627.  AFIPS,  November,  1970. 

Erwin  and  Jensen  describe  the  design  of  a  special  purpose  Interrupt  Processor  (IP) 
which  incorporates  all  of  the  functions  associated  with  detecting, 
acknowledging,  and  scheduling  interrupts  on  a  priority  basis.  The  IP  is 
organized  around  a  special  unit  called  a  queued  content-addressable  memory, 
which  forms  its  primary  storage  and  processing  facility. 

Fabry,  R.S. 

Capability  Based  Addressing. 

Communications  of  the  ACM  17(7):403-412,  July,  1974. 

Fabry  provides  a  good  overview  of  capability  based  addressing  and  protection 
mechanisms,  their  motivation,  and  their  implementation. 

Fancott,  T.  and  Probst,  W.G. 

Software  Distribution  in  a  Microcomputer-Based  Multiprocessor. 

Id  Proc.  of  6th  Texas  Conf.  on  Computing  Systems,  pages  4B.28-4B.34.  IEEE  and 
ACM,  November,  1977. 

Fancott  and  Probst  suggest  that  OS  modules  could  eventually  be  provided  as 
standard  chips  and  then  interconnected  with  some  form  of  bus.  Message 
communication  would  be  used  among  the  functional  modules.  Suggested 
modules  are  device  service  routines,  file  management  package,  task  scheduler, 
resource  allocator,  remote  communications  controller,  and  general  processors 
for  user  tasks.  Prior  to  the  availability  of  such  standard  chips,  microprocessors 
could  be  used. 
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[Farber.D  72]  Farber,  D.J.,  and  Larson,  K.C. 

The  Structure  of  a  Distributed  Computing  System  ■  Software. 

In  Proc.  of  Symp.  on  Computer-Communications  Networks  and  Teletraffic,  pages 
539-545.  Polytechnic  Press,  April,  1972. 

[Flynn.M  72]  Flynn,  M.J.  and  Podvin,  A. 

Shared  Resource  Multiprocessing. 

IEEE  Computer  5(2):20-28,  March/April,  1972. 

Flynn  and  Podvin  propose  an  extension  to  the  hardware  timeshared  ALU  approach 
as  found  in  the  peripheral  processors  of  the  CDC  6600.  32  skeleton  processors, 
divided  into  4  rings  of  8  processors  each,  share  multiple,  high  performance, 
pipelined  execution  units.  A  maximum  performance  of  500  MIPS  is  claimed  to 
be  possible. 

[Ford.W  76]  Ford,  W.S.  and  Hamacher,  V.C. 

Hardware  Support  for  Inter-Process  Communication  and  Processor  Sharing. 

In  Proc.  of  3rd  Annual  Symp.  on  Computer  Architecture,  pages  113-118.  IEEE  and 
ACM,  January,  1976. 

Ford  and  Hamacher  describe  the  hardware  implementation  of  a  simple  single- word 
mailbox  communication  mechanism  which  can  be  used  as  a  basis  for  more 
complex  communication.  All  I/O  is  done  through  the  mailbox  mechanism.  A 
hardware  priority  dispatcher,  which  can  overlap  with  normal  processing, 
provides  fast  process  switches  by  having  a  separate  register  set  for  each 
process. 

[Fraim.L  83]  Fraim,  L.J. 

Scomp:  A  Solution  to  the  Multilevel  Security  Problem. 

IEEE  Computer  16(7):26-34,  July,  1983. 

The  Honeywell  Secure  Communications  Processor  (Scomp)  is  a  commercially 
available  minicomputer  system  supporting  a  multilevel  security  policy.  The 
system  is  based  on  a  security  kernel,  with  special  hardware,  called  the  Security 
Protection  Module,  added  to  enhance  the  performance  of  the  reference 
mediation  operations. 

[Freeman.M  78]  Freeman,  M„  Jacobs,  W.W.,  and  Levy,  LS. 

Perseus:  An  Operating  System  Machine. 

In  Proc.  of  3rd  USA-Japan  Computer  Coni. ,  pages  430-435.  AFIPS  and  IPSJ, 
October,  1978. 

Perseus  consists  of  three  main  modules.  The  Supervisor  receives  user  requests 
and  sequences  the  actions  to  be  performed.  The  Interface,  which  consists  of  a 
memory  m  iager,  resource  manager,  dispatcher,  action  processor(s),  and  I/O 
control,  does  resource  allocation  and  carries  out  actions.  The  Policy  Module 
monitors  system  performance  and  adjusts  paramenters  and  procedures  to  meet 
varying  system  loads. 


[Gerrity.G  81  ]  Gerrity,  G.W. 

On  Processes  and  Interrupts. 

Computer  Architecture  News  9(4):4-14,  June,  1981. 

Gerrity  discusses  hardware  support  for  process  queueing  and  scheduling.  WAKE- 
UP  and  SLEEP  operations  are  used  for  synchronization  and  communication. 
Interrupts  are  provided  as  WAKE-UP  signals.  Process  switching  is  handled 
automatically  and  is  aided  by  the  use  of  “sticky  bits",  so  that  only  modified 
registers  are  saved,  and  "undefined  bits",  so  that  only  registers  which  are  used 
are  loaded. 

[Gifford.D  77]  Gifford,  D.K. 

Hardware  Estimation  of  a  Process'  Primary  Memory  Requirements. 

Communications  oi  the  ACM  20(9):655-663,  September,  1977. 

In  the  Honeywell  6180  processor  supporting  Multics,  an  associative  table  keeps  the 
16  most  recently  used  page  names  in  LRU  order.  By  keeping  track  of  the  miss 
rate  of  this  associative  memory  it  is  possible  to  estimate  the  working  set  size  of 
a  process,  since  the  two  should  be  proportional. 

[Giloi.W  81  ]  Giloi,  W.K.  and  Behr,  P. 

An  IPC  Protocol  and  its  Hardware  Realization  for  a  High-Speed  Distributed 
Multicomputer  System. 

In  Proc.  of  8th  Annual  Symp.  on  Computer  Architecture ,  pages  481  -493.  IEEE  and 
ACM,  May,  1981. 

Each  node  in  the  system  has  a  separate  Cooperation  Handler  processor  for 
supporting  message  communication  according  to  a  producer  and  consumer 
type  of  protocol.  The  Cooperation  Handler  also  provides  some  protection  by 
controlling  access  to  local  objects  from  remote  nodes.  This,  in  cooperation  with 
the  Address  Transformation  and  Memory  Guard  Unit  provides  protected, 
capability  based  access  to  the  local  memory  of  a  node. 

[Goldberg.R  73]  Goldberg,  R.P. 

Architecture  of  Virtual  Machines. 

In  Proc.  of  National  Computer  Conf.,  pages  309-318.  AFIPS,  June,  1973. 

Goldberg  presents  a  model  of  recursive  virtual  machines  as  a  compound  mapping 
of  process  names  into  resource  names,  and  virtual  resource  names  into  real 
resource  names.  He  proposes  a  "hardware  virtualizer”  as  the  natural 
implementation  of  this  model  and  suggests  that  a  virtual  machine  with  this 
support  should  enjoy  performance  comparable  to  the  real  machine. 

[Goldberg.R  74]  Goldberg,  R.P. 

A  Survey  of  Virtual  Machine  Research. 

IEEE  Computer  7(6):34-45,  June,  1974. 

Goldberg  surveys  a  variety  of  new  architectures  which  are  specifically  designed  to 
support  virtual  machines. 
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[Goldstei.B  75]  Goldstein,  B.C.  and  Scrutchin,  T.W. 

A  Machine-Oriented  Resource  Management  Architecture. 

In  Proc.  ol  2nd  Annual  Symp.  on  Computer  Architecture,  pages  214-219.  IEEE  and 
ACM,  January,  1975. 

An  APL-like  machine  is  described  in  which  hardware  supported  locks  or  arbitrary 
software  management  functions  can  be  associated  with  any  object  to  control  its 
use.  The  control  function  is  invoked  automatically  whenever  the  object  is 
referenced. 

[Guillier.P  80]  Guillier,  P.  and  Slosberg,  D. 

An  Architecture  with  Comprehensive  Facilities  of  Inter- Process  Synchronization 
and  Communication. 

In  Proc.  ol  7th  Annual  Symp.  on  Computer  Architecture,  pages  264-270.  IEEE  and 
ACM,  May,  1980. 

The  hardware  support  for  processes,  semaphores,  and  messages  in  the  Honeywell 
Series  60  Level  64  is  described  in  some  detail.  The  Level  64  provides  automatic 
queueing  and  priority  dispatching  of  processes.  Semaphores  with  and  without 
messages  are  supported  and  all  I/O  is  handled  through  such  semaphores. 

[Halstead.R  80]  Halstead,  R.H.,Jr.,  and  Ward,  S.A. 

The  MuNet:  A  Scalable  Decentralized  Architecture  for  Parallel  Computation. 

In  Proc.  ol  7th  Annual  Symp.  on  Computer  Architecture,  pages  139-145.  IEEE  and 
ACM,  May,  1980. 

[Hatch.T  68]  Hatch,  T.F.,  Jr.  and  Geyer,  J.B. 

Hardware  /  Software  Interaction  on  the  Honeywell  Model  8200. 

In  Proc.  ol  Fall  Joint  Computer  Coni.,  pages  891  -901 .  AFIPS,  December,  1968. 

The  Model  8200  features  hardware  controlled  "horizontal  multiprogramming" 
whereby  single  instructions  from  each  of  up  to  8  programs  plus  one  master 
program  (operating  system)  are  executed  in  round  robin  sequence.  The  master 
program  is  given  special  privileges,  including  the  ability  to  block  execution  of 
the  other  programs.  A  memory  and  peripheral  device  protection  scheme  based 
on  locks  and  keys  is  supported. 

[Hennessy.J  81]  Hennessy,  J.,  Jouppi,  N.,  Baskett,  F.,  and  Gill,  J. 

MIPS:  A  VLSI  Processor  Architecture. 

In  Kung,  H.T.,  Sproull,  B.,  and  Steel,  G.,  editor,  VLSI  Systems  and  Computations, 
pages  337-346.  Computer  Science  Press,  1981 . 

MIPS  (Microprocessor  without  Interlocked  Pipe  Stages)  is  a  high  performance, 
reduced  instruction  set  machine.  The  instruction  set  is  essentially  a  compiler- 
driven  encoding  of  the  micromachine,  so  that  little  or  no  decoding  is  needed 
and  the  instructions  correspond  closely  to  microcode  instructions.  The 
processor  is  pipelined  but  provides  no  interlocks  in  hardware,  relying  instead 
on  the  compiler  to  arrange  the  code  appropriately,  inserting  NO-OPs  where 
necessary. 


[Hennessy.J  82] 


[Hobson. R  81] 


[Hoffman. R  78] 


[Horton.F  74] 


[Houdek.M81] 


[Ichbiah.J  79] 


Hennessy,  J.,  Jouppi,  N.,  Baskett,  F.,  Gross,  T.,  and  Gill,  J. 

Hardware  /  Software  Tradeoffs  for  Increased  Performance. 

In  Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems,  pages  2-11.  ACM,  March,  1982. 

The  authors  argue  that  the  most  effective  system  design  methodology  must  make 
simultaneous  tradeoffs  across  all  three  areas  of  hardware,  software  support, 
and  systems  support.  The  MIPS  machine  is  used  as  an  example. 

Hobson,  R.F. 

Structured  Machine  Design:  An  Ongoing  Experiment. 

In  Proc.  of  8th  Annual  Symp.  on  Computer  Architecture,  pages  37-55.  IEEE  and 
ACM.  May,  1981. 

The  Structured  Architecture  Machine  is  a  single  user  high  level  language  computer 
system.  It  contains  a  separate  Environment  Control  Unit  which  provides  the 
traditional  operating  system  functions  such  as  task  initiation,  user  command 
interpretation,  peripheral  communication,  and  so  on. 

Hoffman,  R.L.  and  Soltis,  F.G. 

Hardware  Organization  of  the  System/38. 

In  IBM  System/38:  Technical  Developments,  pages  19-21 .  IBM  GS80-0237, 1978. 

Reprinted  in  Siewiorek  et  al.,  Computer  Structures:  Principles  and  Examples, 
McGraw-Hill,  1982,  pp.  544-546. 

Horton,  F.R.,  Wagler,  D.W.,  and  Tallman,  P.H. 

Virtual  Machine  Assist:  Performance  and  Architecture. 

Technical  Report  TR  75.0006,  IBM,  April,  1974. 

VM  Assist  is  a  set  of  microprograms  for  handling  supervisor  calls  and  1 1  privileged 
instructions  which  were  previously  handled  by  VM/370  software.  It  provides  a 
75  percent  reduction  in  supervisor  state  seconds  and  almost  a  50  percent 
reduction  in  the  elapsed  time  of  batch  throughput. 

Houdek,  M.E.,  Soltis.  F.G.,  and  Hoffman,  R.L 

IBM  System/38  Support  for  Capability- Based  Addressing. 

In  Proc.  of  8th  Annual  Symp.  on  Computer  Architecture,  pages  341  -348.  IEEE  and 
ACM,  May,  1981. 

The  single  level  object  store  and  capability- based  addressing  of  the  System/38  is 
described  in  some  detail. 

Ichbiah,  J.D.,  Barnes,  J.G.P.,  Heliard,  J.C.,  Krieg-Brueckner,  B.,  Roubine,  O.,  and 

Wichmann,  B.A. 

Preliminary  Ada  Reference  Manual  and  Rationale  for  the  Design  of  the  Ada 
Programming  Language. 

SIGPLAN  Notices  14(6),  June,  1979, 

The  original,  preliminary  definition  of  the  Ada  programming  language,  and  an 
explanation  of  why  some  of  its  features  were  designed  the  way  they  were. 
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[Intel  81a] 


[Intel  81b] 


[Intel  82] 


[lshikawa.C  81  ] 


[  Jackson.  P  83] 


ICOT. 

Outline  of  Research  and  Development  Plans  lor  Fifth  Generation  Computer 
Systems. 

Technical  Report,  Institute  for  New  Generation  Computer  Technology,  Tokyo, 
Japan,  May.  1982. 

This  report  presents  a  good  overview  of  the  Japanese  Fifth  Generation  Computer 
Systems  Project.  It  contains  slightly  more  detail  than  the  paper  by  Treleaven 
and  Lima  dealing  with  the  same  topic. 

Intel  Corp. 

iAPX  432  General  Data  Processor  Architecture  Reference  Manual 

Intel  Corp.,  Santa  Clara,  CA,  1981 . 

The  Intel  iAPX  432  microprocessor  could  be  characterized  as  an  "operating  system 
machine”.  It  contains  a  powerful  set  of  mechanisms  in  the  areas  of  storage 
management,  process  scheduling,  and  interprocess  communication.  The  432 
supports  object-oriented  systems. 

Intel  Corp. 

iAPX  432  Interface  Processor  Architecture  Reference  Manual 

Intel  Corp.,  Santa  Clara,  CA,  1981 . 

The  Intel  iAPX  432  Interface  Processor  serves  as  an  I/O  channel.  It  extends  the 
object  and  protection  model  of  the  432  to  the  external  interface,  allowing 
processes  to  deal  with  external  devices  as  objects.  It  also  controls  the  access  to 
main  memory  by  external  devices,  enforcing  the  protection  system. 

Intel  Corp. 

Software  on  Silicon:  The  iAPX  86/30  and  88/30. 

Innovator  3(2):1-2,  Winter,  1982. 

A  special  chip,  the  80130,  contains  the  code  for  many  basic  operating  system 
functions,  and  provides  faster  access  than  standard  memory. 

Ishikawa,  C.,  Sakamura,  K.,  and  Maekawa,  M. 

Adaptation  and  Personalization  of  VLSI-Based  Computer  Architecture. 

In  Proc.  of  Micro  14:  14th  Annual  Workshop  on  Microprogramming,  pages  51-61. 
IEEE  and  ACM,  October,  1981. 

The  authors  discuss  the  advantages  of  monitoring  and  adapting  a  system  to 
improve  its  performance.  The  primary  adaptation  technique  is  migration  of 
frequently  used,  expensive  functions  into  microcode.  It  is  suggested  that 
eventually  the  adaptation  process  will  be  done  automatically. 

Jackson,  P. 

Unix  Variant  Opens  a  Path  to  Managing  Multiprocessor  Systems. 

Electronics  56(  1 5):  1 18-124,  July,  1983. 

The  Convergent  Technologies  MegaFrame  is  a  multiprocessor  system  which 
supports  the  Unix  operating  system.  A  set  of  Motorola  68010  based  application 
processors  handles  all  application  related  tasks  including  process  and  memory 
management.  A  separate  set  of  Intel  iAPX- 186  based  processors  takes  care  of 
all  file  management,  and  another  set  of  iAPX- 186  based  processors  handles 
communications  with  peripheral  devices. 
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[Jagannat.A  80]  Jagannathan,  A. 

A  Technique  for  the  Architectural  Implementation  of  Software  Subsystems. 

In  Proc.  of  7th  Annual  Symp.  on  Computer  Architecture,  pages  236-244.  IEEE  and 
ACM,  May,  1980. 

Jagannathan  shows  how  an  operating  system  can  be  modelled  using  a  “type 
extension”  methodology.  He  argues  that  a  given  type  need  not  be  restricted  to 
implementation  in  hardware  or  microcode  or  software  simply  on  the  basis  of  its 
position  in  the  type  hierarchy. 

[Jenevein.R  81]  Jenevein,  R.,  Degroot,  D.,  and  Lipovski,  G.J. 

A  Hardware  Support  Mechanism  for  Scheduling  Resources  in  a  Parallel  Machine 
Environment. 

In  Proc.  of  8th  Annual  Symp.  on  Computer  Architecture,  pages  57-65.  IEEE  and 
ACM,  May,  1981. 

The  authors  discuss  the  hardware  implementation  of  a  simple  scheduling  algorithm 
for  finding  a  processor  that  is  "close"  to  a  reference  processor  in  a  tree  or 
SW-banyan  interconnection  network. 

[Jensen. E  76]  Jensen,  E.D. 

Distributed  Processing  in  a  Real-Time  Environment. 

In  Distributed  Systems:  Infotech  State  of  the  Art  Report,  pages  303-318.  Infotech, 
1976. 

Jensen  describes  the  hardware  support  for  message  communication  provided  by 
the  Modular  Computer  System,  a  forerunner  of  HXDP.  In  MCS  the  Global  Bus 
Interface  associated  with  each  application  processor  provides  hardware 
support  for  message  queueing  on  output  and  receipt.  It  also  handles  most  of 
the  errors  encountered  in  message  communication. 

[Jensen. E  78]  Jensen,  E.D. 

The  Honeywell  Experimental  Distributed  Processor  -  An  Overview. 

IEEE  Computer  11(1):28-38,  January,  1978. 

In  HXDP  each  application  processor  has  an  associated  Bus  Interface  Unit.  The 
BIUs  provide  extensive  support  for  message  communication,  especially  in 
terms  of  error  handling  within  the  bus  based  communication  system.  Eight 
symbolic  message  destinations,  each  associated  with  an  application  process, 
are  recognized  by  each  BIU. 

[Jensen. E  80]  Jensen,  E.D. 

Distributed  Computer  Systems. 

In  Burks,  S.,  editor,  Computer  Science  Research  Review,  1979-1980,  pages  53-63. 
Carnegie-Mellon  University,  Computer  Science  Department.  1980. 

Jensen  discusses  distributed  computer  systems,  and  in  particular  his  model  of 
decentralized  resource  management  and  control,  and  hardware  /  software 
relationships.  He  briefly  outlines  the  Archons  distributed  computer  system 
research  project. 

[Jensen. E  81  a]  Jensen,  E.D. 

Distributed  Control. 

In  Lampson,  B.W.,  Paul,  M.,  and  Siegert,  H.J.,  editor,  Distributed  Systems 

•  Architecture  and  Implementation,  pages  175-190.  Springer-Verlag,  1981.  ' 


[Jensen.E  81b]  Jensen,  E.D. 

Hardware  /  Software  Relationships  in  Distributed  Computer  Systems. 

In  Lampson,  B.W.,  Paul,  M.,  and  Siegert,  H.J.,  editor,  Distributed  Systems 
■  Architecture  and  Implementation,  pages  413-420.  Springer-Verlag,  1981. 

Jensen  stresses  the  essential  independence  of  two  system  design  decisions  which 
are  often  confused:  layering  and  hardware  versus  software  implementation. 
Layering  involves  deciding  what  functionality  is  performed  at  what  layers  in  the 
system.  Within  each  layer  it  is  then  necessary  to  decide  how  best  to  implement 
the  functions  of  that  layer. 

[Jensen. K  74]  Jensen,  K.  and  Wirth,  N. 

Pascal  User  Manual  and  Report,  2nd  ed. 

Springer-Verlag,  1974. 

[Johnsson.R  82]  Johnsson,  R.K.  and  Wick,  J.D. 

An  Overview  of  the  Mesa  Processor  Architecture. 

In  Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems,  pages  20-29.  ACM,  March,  1982. 

Johnsson  and  Wick  outline  the  main  features  of  the  Mesa  processor.  It  supports 
monitors  and  condition  variables,  and  provides  event  driven,  rather  than  time 
sliced  scheduling.  All  interrupts,  exceptions,  and  communication  with  I/O 
devices  use  the  process  mechanism  and  condition  variables. 

[Jones.A  79]  Jones,  A.K.,  Chansler,  R.J.,  Durham,  I.,  Schwans,  K.,  and  Vegdahl,  S.R. 

StarOS,  a  Multiprocessor  Operating  System  for  the  Support  of  Task  Forces. 

In  Proc.  of  7th  Symp.  on  Operating  Systems  Principles,  pages  117-127.  ACM, 
December,  1979. 

The  microprogrammable  Kmap  processors  of  Cm*  are  used  to  support  capability- 
based  addressing  of  objects  and  a  message  communication  mechanism. 
Operating  system  functions  can  be  readily  migrated  to  microcode  since  the 
Kmap  is  given  "first  refusal"  on  all  system  calls. 

[Jones.A  82]  Jones,  A.K. 

Private  Communication,  September  1982. 

Jones  indicated  that  there  are  a  number  of  rules  of  thumb  regarding  operating 
system  performance  that  circulate  in  the  research  community.  All  general 
purpose  systems  spend  30  to  50  percent  of  their  time  in  the  operating  system. 
Operating  system  kernel  entry  and  exit  cost  is  2  milliseconds,  independent  of 
the  speed  of  the  underlying  machine.  An  I/O  operation  cannot  be  initiated  in 
less  than  10  milliseconds. 

[Kamibaya.N  82]  Kamibayashi,  N.,  Ogawana,  H.,  Nagayama,  K.,  and  Aiso,  H. 

Heart:  An  Operating  System  Nucleus  Machine  Implemented  By  Firmware. 

In  Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems,  pages  195-204.  ACM,  March,  1982. 

Heart  is  an  experiment  to  investigate  the  implementation  of  operating  system  kernel 
functions  in  microcode.  It  is  expected  that  different  virtual  machines  and 
operating  systems  could  then  be  built  on  top  of  the  universal  and  highly 
efficient  primitives  provided  by  Heart. 


[Katsuki.D  78] 


[Kavi.K  82] 


[KnowIton.K  65] 


[Knuth.D  66] 


[Koplin.M  76] 

[Lamport.L  74] 


[Lampson.B  68] 


Katsuki,  D.,  Elsam,  E.S.,  Mann,  W.F.,  Roberts,  E.S.,  Robinson,  J.G.,  Skowronski, 

F.S.,  and  Wolf,  E.W. 

Pluribus:  An  Operational  Fault-Tolerant  Multiprocessor. 

Proc.  ot  the  IEEE  66(10):1 146-1 159,  October,  1978. 

Reprinted  in  Siewiorek  et  al..  Computer  Structures:  Principles  and  Examples, 
McGraw-Hill,  1982,  pp.  371-386. 

Kavi,  K.,  Belkhouche,  B.,  Bullard,  E.,  Delcambre,  L.,  and  Nemecek,  S. 

HLL  Architectures:  Pitfalls  and  Predilections. 

In  Proc.  ot  9th  Annual  Symp.  on  Computer  Architecture,  pages  18-23.  IEEE  and 
ACM,  April,  1982. 

The  authors  discuss  some  of  the  “myths"  surrounding  support  for  high-level 
languages.  They  suggest  that  support  for  operating  systems  and  I/O  is  also 
very  important  since  a  machine  can  spend  over  half  of  its  time  executing 
operating  system  routines. 

Knowlton,  K.C. 

A  Fast  Storage  Allocator. 

Communications  ot  the  ACM  8(10):623-625,  October,  1965. 

This  is  the  First  description  of  the  buddy  system  memory  allocation  algorithm. 

Knuth,  D.E. 

Additional  Comments  on  a  Problem  in  Concurrent  Programming  Control. 

Communications  ot  the  ACM  9(5):321  -322,  May,  1966. 

Knuth  points  out  that  Dijkstra's  solution  to  the  critical  section  mutual  exclusion 
problem  can  lead  to  starvation  of  individual  processes.  He  provides  a 
modification  which  guarantees  access  to  the  critical  section  by  an  individual 
process  within  2N'1  -1  turns.  He  also  points  out  that  if  indivisible  queue 
manipulation  operations  were  provided  by  the  hardware,  the  solution  would  be 
much  simpler  and  more  efficient. 

Koplin,  M.R. 

Ml  38/M  148  Performance  Summary. 

GUIDE  Presentation,  July  1976. 

Lamport,  L. 

A  New  Solution  of  Dijkstra’s  Concurrent  Programming  Problem. 

Communications  of  the  ACM  17(8):453-455,  August,  1974. 

Lamport  provides  a  new,  simple  solution  to  the  critical  section  mutual  exclusion 
problem  which  is  more  robust  than  previous  solutions  in  that  the  system  can 
continue  to  operate  despite  the  failure  of  any  individual  component. 

Lampson,  B.W. 

A  Scheduling  Philosophy  for  Multiprocessing  Systems. 

Communications  ot  the  ACM  11{5):347-360,  May,  1968. 

Lampson  discusses  processor  scheduling  and  points  out  that  a  single  hardware 
scheduler  could  be  used  in  place  of  the  interrupt  system  and  software 
scheduler  of  usual  systems.  A  well  parameterized  scheduler  could  carry  out  its 
functions  quickly  without  being  unduly  restrictive,  i.e.  scheduling  policies  could 
be  changed. 
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[Lampson.B  80]  Lampson,  B.W.  and  Pier,  K.A. 

A  Processor  for  a  High-Performance  Personal  Computer. 

In  Proc.  of  7th  Annual  Symp.  on  Computer  Architecture,  pages  146-160.  IEEE  and 
ACM,  May,  1980. 

The  Dorado  processor  is  capable  of  switching  processes  on  every  machine  cycle.  It 
uses  a  separate  register  set  for  each  process  in  order  to  accomplish  this.  16 
tasks  are  supported,  arranged  in  priority  order,  with  task  switching  performed 
automatically  in  response  to  interrupts  for  higher  priority  tasks. 

[Lampson.B  82a]  Lampson,  B.W. 

Fast  Procedure  Calls. 

In  Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems,  pages  66-76.  ACM,  March,  1982. 

Lampson  argues  that  a  processor's  control  transfer  mechanism  should  handle  a 
variety  of  applications  such  as  procedure  calls  and  returns,  coroutine  transfers, 
exceptions,  and  process  switches  in  a  uniform  way.  Furthermore,  it  should  be 
very  efficient  for  the  common  case  of  procedure  call  and  return.  The  Mesa 
Processor’s  XFER  primitive  is  based  on  the  control  transfer  model  presented  in 
this  paper. 

[Lampson.B  82b]  Lampson,  B.W. 

Private  Communication,  August  1982. 

In  summarizing  the  lessons  learned  from  the  BCC  500  Lampson  stated,  "The 
bottom  line  is  that  specialized  processors  are  a  fine  idea,  but  won’t  make  up  for 
insufficient  speed  of  the  general  processor,  or  for  insufficient  memory,”. 

[Landwehr.C  83]  Landwehr,  C.E. 

The  Best  Available  Technologies  for  Computer  Security. 

IEEE  Computer  1 6<7):86- 1 00,  July,  1 983. 

Landwehr  provides  a  good,  concise  overview  of  the  work  that  has  been  don*-  and  is 
in  progress  in  developing  secure  computer  systems. 

[Lee.  W  74]  Lee,  W.K. 

The  Memory  Management  Function  in  a  Multiprocessor  Computer  System  *  A 
Description  of  the  BCC  500  Memory  Manager. 

Technical  Report  R-2,  The  Aloha  System,  Task  II,  Dept,  of  Electrical  Engineering, 
Univ.  of  Hawaii,  September,  1974. 

Lee  describes  the  BCC  500  Memory  Management  Processor  in  great  detail.  The 
memory  manager  is  continuously  active,  monitoring  the  memory  system  and 
taking  appropriate  action  more  quickly  and  more  often  than  is  possible  when 
time-sharing  these  functions  on  a  CPU  with  other  system  and  user  tasks. 

[Liskov.B  72]  Liskov,  B.H. 

The  Design  of  the  Venus  Operating  System. 

Communications  of  the  ACM  15(3):  144- 149,  March,  1972. 

The  Venus  operating  system  consists  of  a  combination  of  microcode  and  software. 
The  microcode  supports  16  virtual  machines  with  priority  scheduling. 
Semaphores  are  used  for  synchronization,  communication,  and  I/O  completion 
signalling. 


[Maekawa.M  79]  Maekawa,  M.,  Yamazaki,  I.,  Tanaka,  A.,  Nakamura,  A.,  and  Ishida,  K. 

Experimental  Polyprocessor  System  (EPOS)  •  Operating  System. 

In  Proc.  of  6th  Annual  Symp.  on  Computer  Architecture ,  pages  196-201.  IEEE  and 
ACM,  April,  1979. 

EPOS  is  a  functionally  partitioned,  multiprocessor  system.  Most  of  the  operating 
system  is  implemented  in  microcode  and  the  various  functions  can  be 
reassigned  dynamically  to  different  processors. 

[Maekawa.M  82]  Maekawa,  M.,  Sakamura,  K.,  and  Ishikawa,  C. 

Firmware  Structure  and  Architectural  Support  for  Monitors,  Vertical  Migration  and 
User  Microprogramming. 

In  Proc.  of  Symp.  on  Architectural  Support  lor  Programming  Languages  and 
Operating  Systems,  pages  185-194.  ACM,  March,  1982. 

The  microcode  structure  of  a  system  (EPOS)  is  discussed  in  which  the  operating 
system  kernel  is  implemented  in  microcode  as  a  set  of  monitors.  User 
microcode  is  supported  by  having  a  master  (privileged)  mode  for  system  code 
and  a  slave  mode  for  user  code  in  the  micromachine.  The  language  interpreters 
are  slave  mode  microcode. 

[McCarthy.J  62]  McCarthy,  J„  Abrahams,  P.W.,  Edwards,  D.J.,  Hart,  T.P.,  and  Levin,  M.l. 

Lisp  1.5  Programmer's  Manual. 

MIT  Press,  1962. 

[McGehear.P  80]  McGehearty,  P.F. 

Performance  Evaluation  of  a  Multiprocessor  Under  Interactive  Workloads. 

•PhD  thesis,  Carnegie-Mellon  University,  Computer  Science  Department,  August, 
I960. 

McGehearty  measures  and  evaluates  the  performance  of  C.mmp/Hydra  under  a 
variety  of  synthetic,  interactive  workloads.  One  of  the  tools  developed  for  this 
purpose  was  a  Terminal  Emulator,  which  is  a  separate  processor  that  provides 
the  synthetic,  multiuser,  interactive  workloads  based  on  stored  scripts. 

[Metcalfe.R  76]  Metcalfe,  R.M.  and  Boggs,  D.R. 

Ethernet:  Distributed  Packet  Switching  for  Local  Computer  Networks. 

Communications  of  the  ACM  19(7):395-404,  July,  1976. 

Reprinted  inSiewiorek  et  al.,  Computer  Structures:  Principles  and  Examples, 
McGraw-Hill,  1982,  pp.  429-438. 

[Meyer.R  70]  Meyer,  R.A.  and  Seawright,  L.H. 

A  Virtual  Machine  Time-Sharing  System. 

IBM  Systems  Journal  9(3):199-218, 1970. 

Meyer  and  Seawright  discuss  in  some  detail  the  design  and  operation  of  Control 
Program-67  /  Cambridge  Monitor  System  (CP-67 /CMS),  one  of  the  earliest 
virtual  machine  systems,  and  a  forerunner  of  VM/370.  CP-67  ran  on  the 
System/360  Model  67. 

[Mitchell.J  79]  Mitchell,  J.G.,  Maybury,  W.,  and  Sweet,  R. 

Mesa  Language  Manual. 

Technical  Report  CSL  79-3,  Xerox  Palo  Alto  Research  Center,  1979. 

This  is  the  complete  definition  and  reference  manual  for  the  Mesa  programming  • 
language  system. 


[Morris. J  72]  Morris,  J.B. 

Demand  Paging  Through  Utilization  of  Working  Sets  on  the  MANIAC  II. 

Communications  of  the  ACM  15(10):867-872,  October,  1972. 

Morris  describes  the  design  of  a  virtual  memory  system  for  the  MANIAC  II 

computer.  Simple  timer  circuits  for  measuring  elapsed  process  time  since  last 
access  are  added  to  the  page  frames  of  the  system.  These  timers  permit  the 
cheap,  direct  measurement  of  intrinsic  program  working  sets. 

[Moto-oka.T  83]  Moto-oka,  T. 

Overview  to  the  Fifth  Generation  Computer  System  Project. 

In  Proc.  of  10th  Annual  Symp.  on  Computer  Architecture,  pages  417-422.  IEEE  and 
ACM,  June,  1983. 

Moto-oka  provides  a  nice,  brief  overview  of  the  Japanese  Fifth  Generation 
Computer  System  Project,  its  goals  and  approaches. 

[Muftic.S  77]  Muftic,  S.  and  Liu,  M.T. 

The  Design  of  a  Secure  Computer  System. 

In  Proc.  of  Symp.  on  Trends  and  Applications  1977:  Computer  Security  and 
Integrity,  pages  64-70.  IEEE  and  NBS,  May,  1977. 

Muftic  and  Liu  describe  the  design  of  a  secure  computer  system  in  which  special 
hardware  devices  for  encoding  and  decoding  data  are  added  to  all  user 
terminals,  and  a  Security  Control  Device  mediates  all  CPU  accesses  to  main 
memory.  All  data  in  main  memory  and  in  files  can  be  stored  in  encoded  form. 

[Myers.G  80a]  Myers,  G.J.  and  Buckingham,  B.R.S. 

A  Hardware  Implementation  of  Capability- Based  Addressing. 

Computer  Architecture  News  8(6):  12-24,  October,  1980. 

Similar  material  appears  in  Chapter  4  of  Myers,  Advances  in  Computer 
Architecture,  Wiley,  1982. 

[Myers.G  80b]  Myers,  G.J. 

SWARD  -  A  Software-Oriented  Architecture. 

In  Proc.  of  Int.  Workshop  on  High-Level  Language  Computer  Architecture,  pages 
163-168.  Dept,  of  Computer  Science,  Univ.  of  Maryland,  May,  1980. 

A  more  detailed  discussion  of  SWARD  is  contained  in  Myers,  Advances  in 
Computer  Architecture,  Wiley,  1982. 

[Myers.G  82]  Myers,  G.J. 

Advances  in  Computer  Architecture,  Second  Edition. 

John  Wiley  &  Sons,  1982. 

Myers  criticizes  the  conventional  von  Neumann  architecture  for  leaving  a  large 
"semantic  gap"  between  it  and  the  concepts  of  modern  high  level  languages 
and  operating  systems.  He  outlines  various  ways  that  this  gap  can  be  narrowed 
and  discusses  at  great  length  a  number  of  illustrative  systems,  including 
SYMBOL,  SWARD,  and  the  iAPX  432. 


[Namjoo. M  82]  Namjoo,  M.  and  McCluskey,  E.J. 

Watchdog  Processors  and  Capability  Checking. 

In  Proc.  of  12th  Annual  Symp.  on  Fault-Tolerant  Computing,  pages  245-248.  IEEE, 
June,  1982. 

Namjoo  and  McCluskey  describe  the  use  of  a  watchdog  processor  which  monitors 
all  activity  on  the  processor  to  memory  bus.  The  watchdog  processor  contains 
tables  indicating  the  ways  in  which  each  object  is  permitted  to  access  other 
objects.  Any  invalid  accesses  which  are  detected  cause  an  error  signal  to  be 
sent  to  the  CPU. 

[Nelson.B8i]  Nelson,  B.J. 

Remote  Procedure  Call. 

PhD  thesis,  Carnegie-Mellon  University,  Computer  Science  Department,  May,  1981. 

One  of  Nelson's  “performance  lessons"  in  implementing  remote  procedure  calls  is 
to  use  microcode  for  exceptional  performance.  The  physical  transport  time 
remains  the  same,  but  the  enormous  protocol  overhead  is  drastically  reduced. 

[Nissen.S  73]  Nissen,  S.M.  and  Waltach,  S.J. 

The  All  Applications  Digital  Computer. 

In  Proc.  of  Symp.  on  High-Level-Language  Computer  Architecture,  pages  43-51. 
IEEE  and  ACM,  November,  1973. 

AADC  is  a  modular  computer  system  in  which  the  Data  Processing  Elements  are 
specially  designed  to  support  the  APL  programming  language.  Many  of  the  APL 
operators  are  directly  supported  in  hardware.  AADC  also  has  extended 
hardware  support  for  virtual  memory  management.  Fifteen  different  page 
replacement  algorithms  are  directly  supported  in  hardware,  and  the  choice  of 
which  algorithm  to  use  is  under  program  control. 

[Organick.E  72]  Organick,  E.I. 

The  Multics  System:  An  Examination  of  Its  Structure. 

MIT  Press,  1972. 

The  structure  of  the  Multics  operating  system  is  described  in  considerable  detail. 

[Organick.E  73]  Organick,  E.I. 

Computer  System  Organization  -  The  B5700/B6700  Series. 

Academic  Press,  1973. 

The  B5700/B6700  Series,  like  the  B1 700,  provides  microcode  support  for  high- 
level  languages  and  some  operating  system  functions.  There  are  hardware  / 
microcode  facilities  for  handling  tasking,  communication,  and  synchronization. 

[Ousterho.J  80]  Ousterhout,  J.K.,  Scelza,  D.A.,  and  Sindhu,  P.S. 

Medusa:  An  Experiment  in  Distributed  Operating  System  Structure. 

Communications  of  the  ACM  23(2):92-105,  February,  1980. 

In  Medusa,  microcode  in  the  Kmap  processors  of  Cm*  provide  the  interprocess 
communication  mechanism  by  supporting  operations  upon  message  pipes. 
Semaphores,  indivisible  increment,  remote  memory  access,  object  descriptor 
manipulation,  and  fast  block  memory  transfers  are  other  facilities  provided  by 
microcode. 


[Patterso.D  80] 


[Patterso.D  81] 


[Pollack.  F  82] 


[Popek.G  75] 
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Patterson,  D.A.  and  Ditzel,  D.R. 

The  Case  for  the  Reduced  Instruction  Set  Computer. 

Computer  Architecture  News  8(6):25-33,  October,  1980. 

Patterson  and  Ditzel  look  at  the  reasons  behind  the  current  preponderance  of 
complex  instruction  set  machines  and  find  that  the  reasons  are  generally  not 
very  convincing.  They  also  point  out  that  CISCs  have  generally  had  a  number  of 
problems  associated  with  them,  such  as  increased  design  time  and  design 
errors.  They  suggest  that  a  reduced  instruction  set  approach  would  better  serve 
the  goal  of  supporting  a  high-level  language  computer  sytem. 

Patterson,  D.A.  and  Sequin,  C.H. 

RISC  I:  A  Reduced  Instruction  Set  VLSI  Computer. 

In  Pro c.  of  8th  Annual  Symp.  on  Computer  Architecture ,  pages  448-457.  IEEE  and 
ACM,  May,  1981. 

RISC  I  has  a  simple  instruction  set  so  that  almost  all  instructions  execute  in  one 
machine  cycle,  essentially  at  microengine  speed.  A  multiple  overlapping 
register  set  scheme  is  used  to  allow  very  fast  procedure  call  and  return. 
However,  the  many  registers  would  make  process  switching  very  slow. 

Pollack,  F.J.,  Cox,  G.W.,  Hammerstrom,  D.W.,  Kahn,  K.C.,  Lai,  K.K.,  and  Rattner, 

J.R. 

Supporting  Ada  Memory  Management  in  the  iAPX-432. 

In  Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems ,  pages  117-131.  ACM,  March,  1982. 

The  Intel  iAPX  432  capability  based  object  addressing  scheme  is  described  in 
considerable  detail.  Stack,  global  heap,  and  local  heap  allocation  of  objects  are 
all  provided  in  hardware. 

Popek,  G.J.  and  Kline,  C.S. 

The  PDP-1 1  Virtual  Machine  Architecture:  A  Case  Study. 

In  Proc.  of  5th  Symp.  on  Operating  Systems  Principles,  pages  97-105.  ACM, 
November,  1975. 

Popek  and  Kline  discuss  the  architectural  changes  needed  to  support  a  virtual 
machine  system  on  a  PDP-11/45.  Ten  sensitive  instructions  were  modified  to 
trap  when  executed  in  non-privileged  mode  and  a  performance  enhancement 
unit  was  added  to  interpret  most  of  a  virtual  machine’s  references  to  its  upper 
4K  of  memory  (its  I/O  and  status  registers). 

Radin,  G. 

The  801  Minicomputer. 

In  Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems,  pages  39-47.  ACM,  March,  1982. 

The  801  is  a  reduced  instruction  set  machine.  Radin  emphasizes  that  the  goal  of 
such  machines  is  to  provide  a  simple,  powerful  instruction  set  that  can  be 
executed  at  about  the  speed  of  microcode,  in  this  way,  the  operating  system,  as 
well  as  all  applications,  are  essentially  implemented  in  "microcode”.  A  very 
powerful  compiler  technology  Is  an  essential  part  of  a  reduced  instruction  set 
computer,  allowing  all  programming  to  be  done  in  a  high  level  language. 
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[Rashid. R  81] 


[Rattner  .J  80] 


[Reghbati.H  78] 


[Richards.  H  75] 


[Ritchie.D  74] 


Rashid.  R.F.  and  Robertson,  G.G. 

Accent:  A  Communication  Oriented  Network  Operating  System  Kernel. 

In  Proc.  of  8th  Symp.  on  Operating  Systems  Principles ,  pages  64-75.  ACM, 
December,  1981. 

In  Accent,  process  management,  interprocess  communication,  and  virtual  memory 
management  are  all  supported  in  microcode.  Process  switching  is  aided  by 
including  the  process  ID  as  part  of  the  virtual  address  so  that  no  address 
mapping  registers  need  be  switched. 

Rattner,  J.  and  Cox,  G. 

Object-Based  Computer  Architecture. 

Computer  Architecture  News  8(6):4-l  1 ,  October,  1980. 

Rattner  and  Cox  discuss  the  object  based  computer  architecture  of  the  Intel  iAPX 
432.  The  function  migration  to  hardware  was  done  selectively  to  the  best 
advantage  from  the  standpoint  of  speed,  space,  and  flexibility.  Care  was  taken 
to  keep  resource  management  policies  in  software  and  put  resource 
management  mechanisms  in  hardware. 

Reghbati,  H.K.  and  Hamacher,  V.C. 

Hardware  Support  for  Concurrent  Programming  in  Loosely  Coupled 
Multiprocessors. 

In  Proc.  of  5th  Annual  Symp.  on  Computer  Architecture,  pages  195-201 .  IEEE  and 
ACM.  April,  1978. 

This  paper  expands  on  the  work  of  Ford  and  Hamacher  concerning  hardware 
support  for  processes  and  single  word  mailboxes.  It  extends  the  idea  to  handle 
scheduled  waits  in  monitors.  A  centralized  process  status  table,  rather  than 
one  per  processor  as  before,  is  needed  to  support  this  type  of  global  scheduling 
in  a  loosely  coupled  system. 

Richards,  H.,  Jr.  and  Oldehoeft,  A.E. 

Hardware-Software  Interactions  in  SYMBOL-2R’s  Operating  System. 

In  Proc.  of  2nd  Annual  Symp.  on  Computer  Architecture,  pages  113-118.  IEEE  and 
ACM,  January,  1975. 

SYMBOL  supports  32  virtual  processors,  one  per  user  with  one  reserved  for 
operating  system  software.  The  hardware  component  of  the  OS,  the  System 
Supervisor,  is  a  dedicated  processor  responsible  for  scheduling  and  paging.  It 
invokes  the  OS  software  for  other  functions.  The  hardware  scheduling 
algorithms  permit  software  setting  of  various  parameters,  and  the  hardware 
page  replacement  algorithm  takes  into  account  processing  mode,  queue 
position,  and  type  of  data  in  order  to  make  the  best  choice. 

Ritchie,  D.M.  and  Thompson,  K. 

The  UNIX  Time-Sharing  System. 

Communications  of  the  ACM  17(7):365-375,  July,  1974. 

This  is  the  original  paper  describing  the  philosophy,  design,  and  features  of  UNIX. 
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[Rosen.S  68]  Rosen,  S. 

Hardware  Design  Reflecting  Software  Requirements. 

In  Proc.  of  Fall  Joint  Computer  Coni.,  pages  1443-1449.  AFIPS,  December,  1968. 

Rosen  briefly  surveys  the  state  of  hardware  support  for  the  user  visible  “extended 
machine",  as  it  stood  in  1968.  He  suggests  that  interrupt  handling,  dynamic 
storage  allocation,  job  management,  compilation,  and  debugging  aids  are 
important  areas  in  which  hardware  support  could  be  beneficial. 

[Rowan.J  75]  Rowan,  J.H.,  Smith,  D.A.,  and  Swensen,  M.D. 

Toward  the  Design  of  a  Network  Manager  for  a  Distributed  Computer  Network. 

In  Feng,  T.,  editor,  Parallel  Processing:  Proc.  of  Sagamore  Computer  Coni:,  August 
20-23,  1974,  pages  148-166.  Springer-Verlag,  1975. 

The  authors  describe  a  system  in  which  a  special  purpose  processor,  called  the 
Network  Manager,  interfaces  to  a  number  of  functipnal  nodes  over  one  or  more 
shared  buses.  The  Network  Manager  provides  a  simple  interprocessor  message 
communication  facility,  and  basic  priority  and  deadline  scheduling  services. 

[Ruggiero.M  80]  Ruggiero,  M.D.  and  Zaky,  S.G. 

A  Microprocessor- Based  Virtual  Memory  System. 

In  Proc.  of  7th  Annual  Symp.  on  Computer  Architecture,  pages  228-235.  IEEE  and 
ACM,  May,  1980. 

Ruggiero  and  Zaky  describe  a  microprocessor  system  in  which  the  virtual  memory 
management  is  handled  entirely  by  a  separate  microprocessor.  In  the  current 
implementation  the  host  processor  is  expected  to  simply  wait  while  the  page 
fault  is  handled.  However  there  are  a  number  of  potential  advantages  of 
concurrent  execution  between  the  host  and  virtual  memory  processors.  Paging 
out  can  be  done  ahead  of  time  and  more  elaborate  algorithms  can  be  used  at 
no  extra  cost. 

[Rushby.J  83]  Rushby,  J.  and  Randell,  B. 

A  Distributed  Secure  System. 

IEEE  Computer  16(7):55-67,  July,  1983. 

Rushby  and  Randell  describe  a  proposed  secure  distributed  system  in  which 
standard,  untrustworthy  Unix  systems  are  connected  to  a  shared  local  area 
network  through  special  hardware  units  called  Trustworthy  Network  Interface 
Units  (TNIUs).  Individual  Unix  systems  are  assigned  to  separate  security 
classes,  and  the  TNIUs  primarily  use  encryption  techniques  to  enforce 
multilevel  security  rules  on  the  transmission  of  information  between  the 
systems. 


[Saltzer.J  81  ]  Saltzer,  J.H.,  Reed,  D.P.,  and  Clark,  D.D. 

End-to-End  Arguments  in  System  Design. 

In  Proc.  of  2nd  Int.  Conf.  on  Distributed  Computing  Systems,  pages  509-512.  IEEE, 
April,  1981. 

The  authors  argue  that  system  designers  must  think  very  carefully  about  the 

placement  of  functions  in  a  layered  system.  Certain  functions  usually  placed  at 
low  levels  of  the  system  are  often  redundant  or  of  little  value.  In  the  context  of 
message  communication  systems,  the  end-to-end  argument  basically  states 
that  if  a  function  cannot  be  handled  without  the  specialized  knowledge  and  help 
of  the  application  standing  at  both  ends  of  the  communication  system,  then  the 
lower  levels  should  not  strain  very  hard  to  provide  the  function.  At  best  they 
can  enhance  performance  somewhat,  but  the  application  will  still  have  to 
handle  the  function  itself. 

[Schroede.M  72]  Schroeder,  M.D.  and  Saltzer,  J.H. 

A  Hardware  Architecture  for  Implementing  Protection  Rings. 

Communications  of  the  ACM  15(3):  157- 170,  March,  1972. 

Rings  of  protection  in  Multics  were  originally  supported  by  software.  This  paper 
suggests  a  hardware  implementation  of  this  mechanism  so  that  most  cross-ring 
CALL/RETURN  operations  take  the  same  amount  of  time  as  regular 
CALL/RETURN. 

[Schroede.S  73]  Schroeder,  S.C.  and  Vaughn,  L.E. 

A  High  Order  Language  Optimal  Execution  Processor:  Fast  Intent  Recognition 
System  (FIRST). 

In  Proc.  of  Symp.  on  High-Level-Language  Computer  Architecture ,  pages  109-116. 
IEEE  and  ACM,  November,  1973. 

In  FIRST,  Satellite  Processing  Units  handle  I/O  and  preliminary  scheduling.  The 
master  processing  unit  consists  of  multiple  machines  for  compiling  and 
executing  APL  programs,  and  a  separate  processor  for  handling  operating 
system  functions  such  as  job  scheduling,  resource  allocation,  library 
maintenance,  diagnostics,  and  system  error  procedures.  Communication 
among  processors  is  through  shared  memory. 

[SDS  68]  Scientific  Data  Systems. 

SDS  Sigma  7  Computer  Reference  Manual 

1968. 

The  Sigma  7  has  32  register  sets  where  each  register  set  can  hold  the  state  of  a 
different  process.  As  a  result,  fast  process  switching  is  possible  by  simply 
changing  the  active  register  set. 

[Singer  73]  Singer  Business  Machines. 

System  [Ten]  Summary  Manual 

1973. 

The  Singer  System  Ten  has  a  simple  round-robin  time-slicing  supervisor 

implemented  in  hardware.  Memory  partition  sifces  are  fixed  at  installation  time 
and  can  be  changed  later. 
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[Sites.R  79] 


[Smith. D  79] 


[Smith.  W  71] 


[Sockut.G  75] 


[Solomon. M  79] 


[Spector.A  82] 


Sites,  R.L. 

How  to  Use  1000  Registers. 

In  Proc.  of  CalTech  Conf.  on  VLSI,  pages  527-532.  CalTech  Computer  Science 
Dept.,  January,  1979. 

Sites  introduces  the  idea  of  using  cached  multiple  register  sets  with  a  "dribble- 
back"  saving  technique  and  a  prefetch  restoring  technique.  Such  a  design  will 
improve  the  speed  of  procedure  call  and  return,  but  the  many  registers  make 
process  switching  very  slow.  Sites  suggests  multiple  such  caches  to  improve 
process  switching  time. 

Smith,  D.C.P.  and  Smith,  J.M. 

Relational  Data  Base  Machines. 

IEEE  Computer  12(3):28-38,  March,  1979. 

Smith  and  Smith  survey  a  variety  of  hardware  support  techniques  for  databases 
organized  according  to  the  relational  model  of  data. 

Smith,  W.R.,  Rice,  R.,  Chesley,  G.D.,  Laliotis,  T.A.,  Lundstrom,  S.F.,  Calhoun,  M.A., 

Gerould,  L.D.,  and  Cook,  T.G. 

SYMBOL:  A  Large  Experimental  System  Exploring  Major  Hardware  Replacement  of 
Software. 

In  Proc.  of  Spring  Joint  Computer  Conf.,  pages  601-616.  AFIPS,  May,  1971. 

Reprinted  in  Siewiorek  et  al.,  Computer  Structures:  Principles  and  Examples, 
McGraw-Hill.  1982,  pp.  489-502. 

Sockut,  G.H. 

Firmware  /  Hardware  Support  for  Operating  Systems:  Principles  and  Selected 
History. 

Technical  Report  TR  22-75,  Harvard  University,  Center  for  Research  in  Computing 
Technology,  October,  1975. 

Sockut  lists  five  proposed  criteria  for  determining  which  operating  system  functions 
are  the  best  candidates  for  hardware  implementation.  A  selected  history  of  the 
area  is  then  presented,  with  very  brief  descriptions  of  a  number  of  interesting 
systems  and  research  efforts. 

Solomon,  M.H.  and  Finkel,  R.A. 

The  Roscoe  Distributed  Operating  System. 

In  Proc.  of  7th  Symp.  on  Operating  Systems  Principles,  pages  108-114.  ACM, 
December,  1979. 

Spector,  A.Z. 

Performing  Remote  Operations  Efficiently  on  a  Local  Computer  Network. 

Communications  of  the  ACM  25(4):246-260,  April,  1982. 

Spector  describes  a  remote  reference  /  remote  operation  communication  model 
that  can  serve  as  the  basis  for  a  highly  efficient  communication  subsystem.  He 
stresses  that  efficient  implementations  may  require  the  use  of  microcode  or 
specialized  hardware. 
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[Steel.  R  77] 


Steel,  R. 

Another  General  Purpose  Computer  Architecture. 

Computer  Architecture  News  5(S):S- 1 1 ,  April,  1977. 

Steel  describes  an  architecture  in  which  a  separate  System  Management 
Processor  handles  all  process  scheduling,  management,  and  interprocess 
communication.  One  or  more  general  purpose  processors  are  permitted,  and  all 
I/O  is  handled  by  processes  running  on  one  or  more  I/O  processors. 

Procedure  activation  record  allocation  and  deallocation  using  overlapped 
records  is  handled  automatically. 

[Stockenb.J  73]  Stockenberg,  J.E.,  Anagnostopoulos,  P  C.,  Johnson,  R.E.,  Munck,  R.G.,  Stabler, 

G.M.,  and  Van  Dam,  A. 

Operating  System  Design  Considerations  for  Microprogrammed  Mini-Computer 
Satellite  Systems. 

In  Proc.  of  National  Computer  Conf.,  pages  555-562.  AFIPS,  June,  1973. 

This  paper  describes  the  structure  of  the  operating  system  for  the  Brown  University 
Graphics  System  (BUGS).  There  are  3  levels  to  the  system  with  Level  0,  the 
‘‘hardware”,  simulated  by  a  combination  of  microcode  and  software.  Level  0 
allows  experiments  with  hardware  versus  software  tradeoffs.  It  provides  storage 
management,  extended  I/O,  priority  dispatcher,  WAIT/POST  facility,  and 
extended  interrupt  generation  and  task  creation. 

[Stockenb.J  78]  Stockenberg,  J.E.  and  Van  Dam,  A. 

Vertical  Migration  for  Performance  Enhancement  in  Layered  Hardware  /  Firmware 
/  Software  Systems. 

IEEE  Computer  11  (5):35-50,  May,  1978. 

General  performance  improvements  can  be  achieved  by  avoiding  the  prologue  and 
epilogue  overheads  associated  with  functions  at  higher  levels.  The  authors 
describe  a  semi -automated  methodology  for  determining  which  functions  can 
most  profitably  be  migrated  downward.  Although  individual  functions  can  be 
speeded  up  by  a  factor  oflO  by  migrating  them  to  microcode,  factors  of  2 
improvement  are  possible  for  applications  which  use  the  functions  fairly 
heavily. 

[Stonebra.M  81  ]  Stonebraker,  M. 

Operating  System  Support  for  Database  Management. 

Communications  of  the  ACM  24(7):412-418,  July,  1981. 

Stonebraker  discusses  various  operating  system  functions  and  their  usefulness  in 
supporting  database  management  systems  (DBMS).  Often  these  operating 
system  facilities  are  found  to  be  inadequate  and  must  be  provided  anew  by  the 
DBMS.  This  situation  reminds  one  somewhat  of  "the  end-to-end  argument"  of 
Saltzer,  et  al. 

[Strecker.W  78]  Strecker,  W.D. 

VAX- 11/780:  A  Virtual  Address  Extension  to  the  DEC  PDP-11  Family. 

In  Proc.  of  National  Computer  Conf.,  pages  967-980.  AFIPS,  June,  1978. 

Reprinted  in  Siewiorek  et  al.,  Computer  Structures :  Principles  and  Examples, 
McGraw-Hill,  1982,  pp.  716-729. 


[Stritter.E  79]  Stritter,  E.  and  Gunter,  T. 

A  Microprocessor  Architecture  for  a  Changing  World:  The  Motorola  68000. 

IEEE  Computer  12(2):43-52,  February,  1979. 

The  68000  has  a  relatively  complex,  orthogonal  instruction  set  with  many 

addressing  modes,  functions  to  aid  procedure  entry  and  exit,  and  functions  to 
save  and  restore  multiple  registers. 

[Su.S  79]  Su.S.Y.W. 

Cellular-Logic  Devices:  Concepts  and  Applications. 

IEEE  Computer  12(3):11-25,  March,  1979. 

Su  surveys  the  use  of  cellular- logic  devices  to  support  database  systems.  In  such 
devices  a  processing  element  is  used  for  each  circular  memory  element,  and 
the  concurrent  processing  of  these  elements  allows  fast  data  search  and 
manipulation. 

[Swan.R  77]  Swan,  R.J.,  Fuller,  S.H.,  and  Siewiorek,  D.P. 

Cm*:  A  Modular,  Multi-Microprocessor. 

In  Proc.  of  National  Computer  Conf.,  pages  637-644.  AFIPS,  1977. 

The  microprogrammable  Kmap  processors  in  Cm*  are  intended  to  provide  remote 
memory  access  for  the  various  computer  modules.  However,  they  can  also  be 
used  to  implement  many  operating  system  functions,  especially  the 
interprocess  communication  facility. 

[Thacker.C  82]  Thacker,  C.P.,  McCreight,  E.M.,  Lampson,  B.W.,  Sproull,  R.F.,  and  Boggs,  D.R. 

Alto:  A  Personal  Computer. 

In  Siewiorek,  D.P.,  Bell,  C.G.,  and  Newell,  A.,  editor,  Computer  Structures: 
Principles  and  Examples,  pages  549-572.  McGraw-Hill,  1982. 

The  Alto  supports  16  tasks,  each  with  a  different  priority  level.  Task  switching  to  the 
highest  priority  ready  task  is  performed  semiautomatically  in  response  to  the 
TASK  command.  The  various  device  controllers  are  quite  intelligent,  having  the 
full  power  of  the  main  micromachine  available  to  them. 

[Thornton.  J  64]  Thornton,  J.E. 

Parallel  Operation  in  the  Control  Data  6600. 

In  Proc.  of  Fall  Joint  Computer  Conf.,  Pt.  2,  pages  33-40.  AFIPS,  1964. 

Reprinted  in  Siewiorek  et  al.,  Computer  Structures:  Principles  and  Examples, 
McGraw-Hill,  1982,  pp.  730-736. 

[Thurber.K  81  ]  Thurber,  K.J. 

Hardware  Issues. 

In  Lampson,  B.W.,  Paul,  M„  and  Siegert,  H.J.,  editor,  Distributed  Systems 
-  Architecture  and  Implementation,  pages  377-412.  Springer-Verlag,  1981 . 

Thurber  briefly  discusses  the  design  of  an  architecture  containing  a  system  control 
unit  which  is  separate  from  the  application  processor  and  provides  the 
operating  system  kernel.  The  SCU  provides  process  management,  including 
synchronization  (semaphores)  and  communication,  and  handles  all  I/O.  It 
includes  special  state  switch  hardware  to  facilitate  rapid  application  process 
switching. 


[Tokoro.M  80]  Tokoro,  M.,  Tamaru,  K.,  Mizuno,  M.,  and  Hori,  M. 

A  High  Level  Multi-Lingual  Multiprocessor  KMP/II. 

In  Proc.  of  7th  Annual  Symp.  on  Computer  Architecture,  pages  325-333.  IEEE  and 
ACM,  May,  1980. 

In  KMP/II,  an  operating  system  processor  and  an  I/O  processor  are  statically 
assigned.  The  I/O  processor  schedules  its  own  I/O  processes,  which  include 
the  file  system  processes.  The  OS  processor  allocates  language  emulation 
microcode  among  the  application  processors,  provides  interprocess 
communication  facilities  and  scheduling  of  system  and  user  processes,  and 
handles  all  supervisor  calls. 

[Traiger.l  82]  Traiger,  I.L. 

Virtual  Memory  Management  for  Database  Systems. 

Operating  Systems  Review  16(4):26-48,  October,  1982. 

Traiger  discusses  DBMS  buffer  management  by  describing  two  schemes,  shadow 
paging  and  write  ahead  log,  which  can  be  used  to  ensure  proper  recovery  from 
crashes.  He  then  discusses  the  extensions  necessary  for  a  generalized  virtual 
memory  manager  to  be  able  to  handle  (most  of)  the  operations  now  handled  by 
the  DBMS  buffer  manager.  This  would  permit  the  mapping  of  files  into  virtual 
memory  while  maintaining  recovery  capabilities. 

[Treleave.P  82]  Treleaven,  P.C.  and  Lima,  I.G. 

Japan's  Fifth-Generation  Computer  Systems. 

IEEE  Computer  15(8):79-88,  August,  1982. 

Treleaven  and  Lima  provide  a  good  overview  of  the  Japanese  Fifth  Generation 
Computer  Systems  Project.  This  is  a  very  ambitious  project  aimed  at  developing 
"knowledge-information  processing  systems  based  on  innovative  theories  and 
technologies  that  can  oiler  the  advanced  functions  expected  to  be  required  in 
the  1990’s,  overcoming  the  technical  limitations  inherent  in  conventional 
computers."  The  combined  software  and  hardware  of  a  fifth  generation 
computer  system  is  to  provide  three  basic  functions:  the  intelligent  interface, 
knowledge-base  management,  and  problem-solving  and  inference  functions. 
There  will  be  substantial  hardware  support  for  each  of  these  functions. 

[VanDeSne.J  79]  Van  de  Snepscheut,  J.L.A.  and  Slavenburg,  G.A. 

Introducing  the  Notion  of  Processes  to  Hardware. 

Computer  Architecture  News  7(7):13-23,  April,  1979. 

By  having  multiple  register  sets  and  a  queue  of  ready  process  IDs  a  very  fast 
hardware  process  switching  mechanism  is  possible,  perhaps  even  a  switch 
every  instruction  cycle.  Hardware  provided  semaphore  operations  can  be  very 
fast  by  doing  parallel  operations  on  process  control  blocks  maintained  in 
special  hardware  cells.' All  I/O  is  assumed  to  use  semaphores,  so  no  separate 
interrupt  structure  is  necessary. 

[VonPuttk.E  75]  Von  Puttkamer,  E. 

A  Simple  Hardware  Buddy  System  Memory  Allocator. 

IEEE  Transactions  on  Computers  C-24(10):953-957,  October,  1975. 

Von  Puttkamer  shows  how  the  buddy  system  algorithm  for  memory  allocation  can 
be  implemented  very  efficiently,  and  quite  simply,  using  special  hardware.  Shift 
registers  are  used  to  maintain  the  binary  tree  which  represents  the  current  state 
of  the  memory  system. 


[Wall.C  74] 


Wall,  C.F. 

Design  Features  of  the  BCC  500  CPU. 

Technical  Report  R-1,  The  Aloha  System,  Task  II,  Dept,  of  Electrical  Engineering, 
Univ.  of  Hawaii,  January,  1974. 

The  BCC  500  is  a  functionally  partitioned  multiprocessor  designed  to  support  a 
large  number  of  time  sharing  users.  There  are  2  independent  CPUs  for  running 
user  programs,  a  process  scheduling  processor,  a  memory  management 
processor,  and  a  terminal  handling  processor.  This  report  concentrates  on  the 
CPU  architecture. 

[Ward.S  79]  Ward,  S.A. 

TRIX:  A  Network-Oriented  Operating  System. 

Technical  Report,  MIT,  December,  1979. 

[Watson.W  72]  Watson,  W.J. 

The  Tl  ASC  -  A  Highly  Modular  and  Flexible  Super  Computer  Architecture. 

In  Proc.  of  Fall  Joint  Computer  Conf.,  pages  221-228.  AFIPS,  December,  1972. 

Reprinted  in  Siewiorek  et  al.,  Computer  Structures:  Principles  and  Examples, 
McGraw-Hill,  1982,  pp.  753-762. 

[Wegner.P  80]  Wegner,  P. 

Programming  with  Ada:  An  Introduction  by  Means  of  Graduated  Examples. 

Prentice- Hall,  1980. 

[Wendorf.J  83]  Wendorf,  J.W. 

Hardware  Support  for  Operating  System  Architectures. 

In  preparation,  Computer  Science  Department,  CamcgieMellon  University. 

[Wendorf.R  82]  Wendorf,  R.G. 

Decentralized  Resource  Management. 

In  Definition  of  Distributed  Operating  System  Concepts  and  Techniques:  NOSC 
Contract  N66001-81-C-0484  First  Quarterly  Progress  Report.  20  January  1982. 
Camegie-Mellon  University,  Computer  Science  Department,  1982. 

[Werkheis.A  70]  Werkheiser,  A.H. 

Microprogrammed  Operating  Systems. 

In  Preprints  of  3rd  Annual  Workshop  on  Microprogramming,  pages  II1.C.1  -III.C.16. 
IEEE  and  ACM,  October,  1970. 

Werkheiser  divides  operating  system  functions  into  3  levels:  miniprimitives  (data 
structure  manipulation  and  searching,  subroutine  linkage,  etc.);  midiprimitives 
(IPC,  semaphores,  memory  management,  file  allocation,  scheduling,  etc.);  and 
maxiprimitives  (compilers,  loaders,  etc.).  He  suggests  which  primitives  for  each 
level  are  most  appropriate  for  microprogrammed  implementation,  based  on  his 
suggested  criteria  for  making  such  implementation  decisions. 


[Wilkes. M  82]  Wilkes,  M.V. 

Hardware  Support  for  Memory  Protection:  Capability  Implementations. 

In  Proc.  of  Symp.  on  Architectural  Support  for  Programming  Languages  and 
Operating  Systems,  pages  107-116.  ACM,  March,  1982. 

Wilkes  discusses  the  present  state  of  hardware  supported  capability  systems  and 
makes  some  suggestions  on  how  to  reduce  their  complexity.  He  believes  that  it 
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Abstract 

This  paper  discusses  the  synchronization  issues  that  arise  when  transaction  facilities  are  extended  for  use  with 
shared  abstract  data  types.  A  formalism  for  specifying  the  concurrency  properties  of  such  types  is  developed, 
based  on  dependency  relations  that  arc  defined  in  terms  of  an  abstract  type’s  operations.  The  formalism 
requires  that  the  specification  of  an  abstract  type  state  whether  or  not  cycics  involving  these  relations  should 
be  allowed  to  form.  Directories  and  two  types  of  queues  arc  specified  using  the  technique,  and  the  degree  to 
which  concurrency  is  restricted  by  type-specific  properties  is  exemplified.  The  paper  also  discusses  how  the 
specifications  of  types  interact  to  determine  the  behavior  of  transactions.  A  locking  technique  is  described 
that  permits  implementations  to  make  use  of  type-specific  information  to  approach  the  limits  of  concurrency. 
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1  Introduction 

Transactions  facilities,  as  provided  in  many  database  systems,  permit  the  definition  of  transactions 
containing  operations  that  read  and  write  the  database  and  that  interact  with  the  external  world.  The 
transaction  facility  of  the  database  system  guarantees  that  each  invocation  of  a  transaction  will  execute  at  most 
once  (i.c.,  either  commit  or  abort)  and  will  be  isolated  from  cite  deleterious  effects  of  all  concurrently 
executing  transactions.  To  make  these  guarantees,  the  transaction  facility  manages  transaction 
synchronization,  recovery,  and.  if  necessary,  inter-site  coordination.  Many  papers  have  been  written  about 
transactions  in  the  context  of  both  distributed  and  non-distributed  databases  [Bernstein  81.  Eswaran  76,  Gray 
80.  Lampson  81.  Lindsay  79]. 

There  are  a  number  of  ways  in  which  transaction  facilities  could  be  extended  to  simplify  the  construction  of 
many  types  of  reliable  distributed  programs.  Kxicnsions  that  allow  a  wider  variety  of  operations  to  be 
included  in  a  transaction  would  facilitate  manipulation  of  shared  objects  other  than  a  database.  Extensions 
that  permit  transaction  nesting  would  facilitate  more  flexible  program  organizations,  as  would  extensions 
allowing  some  form  of  inter-transaction  communication  of  uncommitted  data.  Although  the  synchronization, 
recovery,  and  inter-site  coordination  mechanisms  needed  to  support  database  transaction  facilities  are 
reasonably  well  understood,  these  mechanisms  require  substantial  modification  to  support  such  extensions. 
For  example,  they  must  be  made  compatible  with  the  abstract  data  type  model  and  with  general 
implementation  techniques  such  as  dynamic  storage  allocation. 

Lomet  [Lomet  77]  considered  some  of  the  problems  encountered  in  developing  general-purpose  transaction 
facilities,  but  more  recently,  much  of  the  research  in  this  area  has  been  done  at  MIT.  Moss  and  Reed  have 
discussed  nested  transactions  and  other  related  systems  issues  [Moss  81.  Reed  78],  As  part  of  the  Argus 
project,  extensions  to  CLU  have  been  proposed  that  incorporate  primitives  for  supporting  transactions 
[Liskov  82a,  Liskov  82b].  Additionally,  Weihl  has  considered  transactions  that  contain  calls  on  shared 
abstract  types  such  as  sets  and  message  queues,  and  has  discussed  their  implementation  [Weihl  83a.  Weihl 
83b].  Transactions  will  also  be  available  in  the  Gouds  distributed  operating  system  [Allchin  83]. 

This  paper  focuses  on  one  important  issue  that  arises  when  extending  transaction  facilities:  the 
synchronization  of  operations  on  shared  abstract  data  types  such  as  directories,  stacks,  and  queues.  After  a 
presentation  of  background  material  in  the  following  section.  Section  3  introduces  some  tools  and  notation  for 
specifying  shared  abstract  types.  Section  4  describes  three  particular  data  types  and  uses  the  tools  to  specify 
how  operations  on  these  types  can  interact  under  conditions  of  concurrent  access  by  multiple  transactions. 
The  specifications  that  are  developed  make  explicit  use  of  type-specific  properties,  and  it  is  shown  how  this 
approach  permits  greater  concurrency  than  standard  techniques  that  do  not  use  such  information.  Section 


5  discusses  how  the  specifications  of  individual  types  interact  to  determine  global  properties  of  groups  of 
transactions.  Section  6  proposes  an  extensible  approach  to  locking  that  can  be  used  for  synchronization  in 
implementations  intended  to  meet  these  specifications.  Finally.  Section  7  summarizes  the  major  points  of  this 
paper  and  concludes  with  a  brief  discussion  of  other  considerations  in  the  implementation  of  user-defined, 
shared  abstract  data  types. 

2  Background 

Transactions  aid  in  maintaining  arbitrary  application-dependent  consistency  constraints  an  stored  data.  The 
constraints  must  be  maintained  despite  failures  and  without  unnecessarily  restricting  the  concurrent 
processing  of  application  requests. 

In  the  database  literature,  transactions  arc  defined  as  arbitrary  collections  of  database  operations  bracketed 
by  two  markers:  BeginTransaction  and  EnJT ransaction.  A  transaction  that  completes  successfully  commits r, 
an  incomplete  transaction  can  terminate  unsuccessfully  at  any  time  by  aborting.  Transactions  have  the 
following  special  properties: 

1.  Either  all  or  none  of  a  transaction's  operations  arc  performed.  This  property  is  usually  called 
failure  atomicity. 

Ufa  transaction  completes  successfully,  the  effects  of  its  operations  will  never  subsequently  be  lost. 

This  property  is  usually  called  permanence. 

3.  If  a  transaction  aborts,  no  other  transactions  will  be  forced  to  abort  as  a  consequence.  Cascading 
aborts  are  not  permitted. 

4.  If  several  transactions  execute  concurrently,  they  affect  the  database  as  if  they  were  executed 
serially  in  some  order.  This  property  is  usually  called  senalizability. 

Transactions  lessen  the  burden  on  application  programmers  by  simplifying  the  treatment  of  failures  and 
concurrency.  Failure  atomicity  makes  certain  that  when  a  transaction  is  interrupted  by  a  failure,  its  partial 
results  are  undone.  Programmers  are  therefore  free  to  violate  consistency  constraints  temporarily  during  the 
execution  of  a  transaction.  Senalizability  ensures  that  other  concurrently  executing  transactions  cannot 
observe  these  inconsistencies.  Permanence  and  prevention  of  cascading  aborts  limit  the  amount  of  effort 
required  to  recover  from  a  failure.  Transaction  models  that  do  not  prohibit  cascading  aborts  are  possible,  but 
we  do  not  consider  them. 

Our  model  for  using  transactions  in  distributed  systems  differs  from  this  traditional  model  in  several  ways. 
The  most  important  difference  is  that  we  incorporate  the  concept  of  an  abstract  data  type.  That  is, 
information  is  stored  in  typed  objects  and  manipulated  only  by  operations  that  arc  specific  to  a  particular 
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object  type.  The  users  of  a  type  arc  given  a  specification  that  describes  die  effect  of  each  operation  on  the 
stored  data,  and  new  abstract  types  can  be  implemented  using  existing  ones.  The  details  of  how  objects  are 
represented  and  how  die  operations  arc  carried  out  arc  known  only  to  a  type’s  implementor.  Abstract  data 
types  grew  out  of  the  class  construct  in  Simula  (Dahl  72J,  and  arc  supported  in  many  other  programming 
languages  including  Cl.U  [Liskov  77j,  Alphard  [Wuif  76).  and  Ada  [Dept,  of  Defense  82],  as  well  as  in 
operating  systems,  c.g.  Hydra  [Wolf  74].  In  our  system  model,  transactions  arc  composed  of  operations  on 
objects  that  arc  instances  of  abstract  types.  Of  particular  interest  arc  diose  objects  diat  arc  not  local  to  a  single 
transaction.  These  arc  instances  of  shared  abstract  types. 

We  assume  that  the  facilities  for  implementing  shared  abstract  types  and  for  coordinating  the  execution  of 
transactions  that  operate  on  them  arc  provided  by  a  basic  system  layer  that  executes  at  each  node  of  the 
system.  This  transaction  kernel  exports  primitives  for  synchronization,  recovery,  deadlock  management,  and 
inter-site  communication.  In  some  ways,  a  transaction  kernel  is  similar  to  the  RSS  of  System  R  [Gray  81].  A 
transaction  kernel,  however,  is  intended  to  run  on  a  bare  machine  and  must  supply  primitives  useful  for 
implementing  arbitrary  data  types,  whereas  the  RSS  has  the  assistance  of  an  underlying  operating  system  and 
only  provides  specialized  primitives  tailored  for  manipulating  a  database. 

Another  difference  between  our  system  model  and  the  traditional  transaction  model  is  that  we  do  not 
necessarily  require  that  transactions  appear  to  execute  serially.  Scrialir-ability  ensures  that  if  transactions  work 
correctly  in  the  absence  of  concurrency,  any  interleaving  of  their  operations  that  is  allowed  by  the  system  will 
not  affect  their  correctness.  But  sometimes,  scrializability  is  too  strong  a  property,  and  requiring  it  restricts 
concurrency  unnecessarily.  For  example,  it  is  usually  unnecessary  for  two  letters  mailed  together  and 
addressed  identically  to  appear  in  their  recipient's  mailbox  together.  However,  scrializability  is  violated  if  the 
letters  do  not  arrive  contiguously,  because  there  is  no  longer  the  appearance  that  the  sender  has  executed 
without  interference  from  other  senders.  Thus,  it  may  be  desirable  for  some  shared  abstract  types  to  allow 
limited  non-serializablc  execution  of  transactions.  This  idea  has  also  been  investigated  by  Garcia-Molina 
[Garcia-Molina  83]  and  Sha  et  aL  [Sha  83]. 

Scrializability  guarantees  that  an  ordering  can  be  defined  on  a  group  of  transactions.  If  the  transactions 
share  some  common  objects,  serializability  requires  that  these  objects  be  visited  in  the  same  order  by  all  the 
transactions  in  the  group.  In  the  next  section,  a  more  general  ordering  property  of  transactions  is  defined,  of 
which  scrializability  is  a  special  ease.  We  will  show  that  it  is  possible  to  prove  that  transactions  work  correctly 
in  the  presence  of  concurrency,  even  if  they  do  not  appear  to  execute  serially. 

In  order  to  maintain  the  special  properties  of  transactions  in  our  model,  the  operations  on  shared  abstract 
types  that  compose  them  must  meet  certain  requirements.  To  guarantee  the  failure  atomicity  of  transactions, 


it  must  be  possible  to  undo  any  operation  upon  transaction  abort  Therefore,  an  undo  operation  must  be 
provided  for  each  operation  on  a  shared  abstract  type.  Recovery  is  not  the  main  concern  of  this  paper,  and  we 
will  be  considering  undo  operations  only  as  they  pertain  to  synchronization  issues.  Further  discussion  of 
recovery  issues  can  be  found  in  a  related  paper  [Schwarz  S3). 

Operations  on  shared  abstract  types  must  also  meet  three  synchronization  requirements: 

1.  Operations  must  be  protected  from  anomalies  that  could  be  caused  by  other  concurrently 
executing  operations  on  die  same  object.  Freedom  from  these  concurrency  anomalies  ensures  that 
an  invocation  of  an  operation  on  a  shared  object  is  not  affected  by  other  concurrent  operation 
invocations.  This  is  the  same  property  that  monitors  provide  [Hoarc  74]. 

1  To  preclude  the  possibility  of  cascading  aborts,  operations  on  shared  objects  must  not  be  able  to 
observe  information  that  might  change  if  an  uncommitted  transaction  were  to  abort.  This  may 
necessitate  delaying  the  execution  of  operations  on  behalf  of  some  transactions  until  other 
transactions  complete,  either  successfully  or  unsuccessfully. 

3.  When  a  group  of  transactions  invokes  operations  on  shared  objects,  the  operations  may  only  be 
interleaved  in  ways  that  preserve  scrializability  or  some  weaker  ordering  property  of  the  group  of 
transactions.  The  synchronization  needed  to  control  interleaving  cannot  be  localized  to  individual 
shared  objects,  but  rather  requires  cooperation  among  all  the  objects  shared  by  the  transactions. 

Traditional  methods  for  synchronizing  access  to  an  instance  of  a  shared  abstract  type  arc  designed  solely  to 
ensure  the  first  goal:  correctness  of  individual  operations  on  an  object.  This  paper  is  concerned  with  the 
second  and  third  goals.  We  examine  the  problem  of  specifying  the  synchronization  needed  to  achieve  them, 
as  well  as  the  support  facilities  that  the  transaction  kernel  must  provide  to  implementors  of  shared  abstract 
types. 

3  Dependencies:  A  Tool  for  Reasoning  About  Concurrent  Transactions 

This  section  introduces  a  theory  that  can  be  used  to  reason  abouc  the  behavior  of  concurrent  transactions.  It 
allows  the  standard  definition  of  scrializability  to  be  recast  in  terms  of  shared  abstract  types,  and  provides  a 
convenient  way  of  expressing  other  ordering  properties.  The  theory  is  also  useful  in  understanding  cascading 
aborts. 

3.1  Schedules 

Schedules  [Eswaran  76,  Gray  75]  can  be  used  to  model  the  behavior  of  a  group  of  concurrent  transactions. 
Informally,  a  schedule  is  a  sequence  of  <transaction,  operation?  pairs  chat  represents  the  order  in  which  the 
component  operations  of  concurrent  transactions  are  interleaved.  Schedules  are  also  known  as 
histories  [Papadimitriou  77]  and  logs  [Bernstein  79],  In  some  of  the  traditional  database  literature,  the 
operations  in  schedules  arc  assumed  to  be  arbitrary;  no  semantic  knowledge  about  them  is  available  [Eswaran 
76].  In  this  ease,  a  schedule  is  merely  an  ordered  list  of  transactions  and  the  objects  they  touch: 
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In  other  work,  operations  arc  characterized  as  Rcad(R)  or  Writc(W)[Gray  75],  in  which  case  the  schedule 

includes  that  semantic  information: 
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To  analyze  transactions  that  contain  operations  on  specific  shared  abstract  types,  we  will  consider  schedules 
in  which  these  operations  arc  characterized  explicitly.  For  example,  a  schedule  may  contain  operations  to 
enter  an  element  on  a  queue  or  to  insert  an  entry  into  a  directory.  We  call  these  abstract  schedules,  because 
they  describe  the  order  in  which  operations  affect  objects,  regardless  of  any  reordering  that  might  be  done  by 
their  implementation.1  Given  the  initial  state  of  a  set  of  objects,  an  abstract  schedule  of  operations  on  these 
objects,  and  specifications  for  the  operations  in  the  schedule,  the  result  of  each  operation  and  the  final  state  of 
the  objects  can  be  deduced.  For  instance,  consider  the  following  abstract  schedule,  which  is  composed  of 
operations  on  Q.  a  shared  object  of  type  FIFO  Queue.  The  operations  QEntcr  and  QRemove  respectively 
append  an  element  to  the  tail  of  a  FIFO  Queue  and  remove  one  from  it’s  head.  Assume  Q  to  be  empty 
initially. 

T  :  QEnter(Q,  X) 

T2:  Q£nter(Q,  Y) 

T3:  QRemove(Q) 

From  this  abstract  schedule  and  the  initial  contents  of  the  Queue,  one  can  deduce  the  state  of  Q  at  any  point 
in  the  schedule.  Thus  one  may  conclude  that  the  QRemove  operation  returns  X,  and  that  only  Y  remains  on 
the  Queue  at  the  end  of  the  schedule. 

3.2  Dependencies  and  Consistency 

By  examining  an  abstract  schedule,  it  is  possible  to  determine  what  dependencies  exist  among  the 
transactions  in  the  schedule.  The  notation  Dr  T:X  — »Q  T^Y  will  be  used  to  represent  the  dependency  D 
formed  when  transaction  T  performs  operation  X  and  transaction  T  subsequently  performs  operation  Y  on 
some  common  object  O.  The  object,  transaction,  or  dependency  identifiers  may  be  omitted  when  they  are 
unimportant.  The  set  of  ordered  pairs  {(T^  T.)}  for  which  there  exist  X,  Y  and  0  such  that  D:  T;:X  — 0  T:Y 
forms  a  relation,  denoted  <D.  IfT  <D  T.,  T  precedes  T^  and  T.  depends  on  T;,  under  the  dependency  D. 


la  Section  4.4  we  will  define  a  second  kind  of  schedule,  the  invocation  schedule,  which  reflects  the  concurrency  of  specific 
implementations. 
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Examples  of  dependencies  and  their  corresponding  relations  can  be  drawn  from  traditional  database 
systems.  For  instance,  consider  a  system  in  which  no  semantic  knowledge,  either  about  entire  transactions  or 
about  their  component  operations,  is  available  to  the  concurrency  control  mechanism.  The  only  requirement 
is  that  each  individual  transaction  be  correct  in  itself:  it  must  transform  a  consistent  initial  state  of  the 
database  to  a  consistent  final  state.  Under  these  conditions,  only  serializable  abstract  schedules  can  be 
guaranteed  to  preserve  the  correctness  nfiindividual  transactions. 

Since  all  operations  arc  indistinguishable,  only  one  possible  dependency  D  can  be  defined:  Tj  <D  T2  if  Tj 
performs  any  operation  on  an  object  later  operated  on  by  Tr  Now.  consider  <*D.  the  transitive  closure  of  <Q. 
A  schedule  is  ordcmble  with  respect  to  {<D}  iff  <*D  is  a  partial  order.  In  other  words,  there  arc  no  cycles  of 
the  form  T1  <D  T2  <D...<D  Tn  <D  T..  In  general,  a  schedule  is  ordcrablc  with  respect  to  S.  where  S  is  a  set  of 
dependency  relations,  iff  each  of  the  relations  in  S  have  a  transitive  closure  that  is  a  partial  order.  The 
relations  in  S  arc  referred  to  as  proscribed  relations,  and  we  will  use  ordcrability  with  respect  to  a  set  of 
proscribed  dependency  relations  to  describe  ordering  properties  of  groups  of  transactions.  Abstract  schedules 
that  arc  ordcrablc  with  respect  to  a  specified  set  of  proscribed  relations  will  be  called  consistent  abstract 
schedules. 

It  can  be  shown  that  ordcrability  with  respect  to  {<c>}  is  equivalent  to  scrializability  [Eswaran  76].  Given  a 
schedule  ordcrablc  with  respect  to  {<0},  a  transaction  T.  and  the  set  O  of  objects  to  which  T  refers,  every 
other  transaction  that  refers  to  an  object  in  O  can  unambiguously  be  said  cither  to  precede  T  or  to  follow 
T.  Thus  T  depends  on  a  well-defined  set  of  transactions  that  precede  it.  and  a  well-defined  set  of  transactions 
depend  on  T.  Each  transaction  sees  the  consistent  database  state  left  by  those  transactions  that  precede  it.  and 
(by  assumption)  leaves  a  consistent  state  for  those  that  follow.  The  set  of  schedules  for  which  <*D  is  a  partial 
order  constitutes  the  set  of  consistent  abstract  schedules  for  a  system  that  employs  no  semantic  knowledge. 

The  scheme  described  above  prevents  cycles  in  the  most  general  possible  dependency  relation,  hence  it 
maximally  restricts  concurrency.  By  considering  the  semantics  of  operations  on  objects,  it  is  possible  to 
identify  some  dependency  relations  for  which  cycles  may  be  allowed  to  form.  For  example,  consider  a 
database  with  a  Rcad/Writc  concurrency  control.  Such  systems  recognize  two  types  of  operations  on  objects: 
Read(R)  and  WritefW).  Thus  there  arc  4  possible  dependencies  between  a  pair  of  transactions  that  access  a 
common  object: 

•  Dji  T:R  — *q  T.:R.  T  reads  an  object  subsequently  read  by  Tj. 

•  D2:  T;:R  — »0  Tj:W.  T}  reads  an  object  subsequently  modified  by  Tj. 

•  Dy  T.:W  ~*0  Tj.’R.  T  modifies  an  object  subsequently  read  by  Tj. 


Be 


•  D4:  T.:W  — »Q  T:W.  T  modifies  an  object  subsequently  modified  by  T.. 


The  earlier  scheme,  by  not  distinguishing  between  these  dependencies,  prevents  cycles  from  forming  in  the 
dependency  relation  <D,  which  is  the  union  of  all  four  individual  relations.  By  contrast.  Read/ Write 
concurrency  controls  take  into  account  the  fact  that  R  — *  R  dependencies  cannot  influence  system  behavior. 
That  is.  given  a  pair  of  transactions.  Tj  and  Tj.  and  an  abstract  schedule  in  which  both  Tj  and  T,  perform  a 
Read  on  a  shared  object,  the  semantics  of  Read  operations  ensure  that  neither  T1%  T2  nor  any  other 
transaction  in  the  schedule  can  determine  whether  Tj  <l5  T^orT,  <0  Tr  Since  these  dependencies  cannot 
be  observed,  they  cannot  compromise  scrializability.  nor  can  they  affect  the  outcome  of  transactions.  We  call 
dependencies  meeting  this  criterion  insignificant.  Korth  has  also  noted  that  when  operations  arc 
commutative,  their  ordering  docs  not  affect  scrializability  [Korth  83]. 


For  the  Rcad/Writc  ease,  the  necessary  condition  for  scrializability  can  be  restated  as  follows  in  terms  of 

dependency  relations:  a  schedule  is  serializable  if  it  is  ordcrablc  with  respect  to  {<D  uD  uD  }  [Gray  75].  By 

“2  3  4 

allowing  multiple  readers,  Rcad/Writc  schemes  permit  the  formation  of  cycles  in  the  <n  dependency 

ul 

relation,  and  in  relations  that  include  <_.  ,  while  preventing  cycles  in  the  relation  that  is  the  union  of  <n  .  <n 

D1  u2  u3 

and  <D  .  For  example,  consider  the  following  schedules,  which  have  identical  effects  on  the  system  state: 


V  R(0t) 

V  R«M 

V  w(o  ) 


V  ROM 

V  ROM 
T,:  W(01) 


In  the  first  schedule,  Tt  <D  T2  and  T2  <D  Tr  Hence,  there  is  a  cycle  in  the  relation  <D  uD  ,  although 
<D  uD  uD  “  cycle-free.  In  the  second  schedule,  the  first  two  steps  are  reversed  and  neither  cycle  is  present. 

On  the  other  hand,  the  following  two  schedules  are  not  necessarily  identical  in  effect: 

V  R(<M  T2:  W(0j) 

V  W(0  )  Tj :  R(0l) 

V  «(<M  Tjt  W(02) 

In  this  ease,  the  first  schedule  is  not  serializable  because  T2  <D  T2  and  T2  <D  Tr  thus  forming  a  cycle  in  the 
relation  <D  uD  .  which  is  a  sub-relation  of  <D  uD  uD  .  Tx  observes  02  before  it  is  written  by  T2,  but  the  final 

2  4  2  3  4 

state  of  02  reflects  the  Write  of  T2  rather  than  T2,  implying  that  T2  ran  after  Tj.  The  second  schedule  has  no 
cycle  and  is  serializable. 


In  summary,  orderability  with  respect  to  a  set  of  proscribed  dependency  relations  provides  a  precise  way  to 
characterize  consistent  schedules.  For  a  concurrency  control  that  enforces  scrializability  with  no  semantic 
knowledge  at  all  about  operations,  the  set  of  proscribed  relations  must  contain  <D,  which  is  equivalent  to  the 
union  of  every  possible  dependency  relation.  For  a  Rcad/Writc  database  scheme,  the  set  contains  the 
<R_w  u  W-.R  u  w— *w  re'a^on-  When  type-specific  semantics  arc  considered,  type-specific  dependency 


relations  can  be  defined  for  each  type.  In  Section  4,  dependencies  arc  used  to  define  interleaving 
specifications  for  various  abstract  types.  These  specifications  provide  the  information  needed  to  determine 
how  an  individual  type  can  contribute  toward  maintaining  a  global  ordering  property  such  as  scrializabiiity. 
If  a  specification  guarantees  ordcrability  with  respect  to  the  union  of  all  significant  dependency  relations  for  a 
given  type,  then  it  is  strong  enough  to  permit  scrializability.  In  general,  however,  more  concurrency  can  be 
obtained  when  only  weaker  ordering  properties  arc  guaranteed.  The  way  in  which  the  interleaving 
specifications  of  multiple  types  interact  to  preserve  global  ordering  properties  is  discussed  in  Section  S. 

3.3  Dependencies  and  Cascading  Aborts 

Dependencies  arc  also  useful  in  understanding  cascading  aborts.  A  cascading  abort  is  possible  when  a 
dependency  forms  between  two  transactions,  the  first  of  which  is  uncommitted.  An  abort  by  this 
uncommitted  transaction  may  cascade  to  those  that  depend  on  it.  Whether  or  not  a  cascade  actually  must 
occur  depends  on  the  exact  type  of  dependency  involved,  and  the  properties  of  die  object  being  acted  upon. 
For  example,  consider  the  four  general  dependency  relations  that  arise  in  Rcad/Writc  database  systems. 
R  —  R  dependencies  arc  insignificant,  and  can  never  cause  cascading  aborts.  This  is  analogous  to  the  role  of 
these  dependencies  in  determining  ordcrability.  Likewise.  R  — *  W  and  W  — ►  W  dependencies  need  not  cause 
cascading  aborts,  because  in  both  eases  the  outcome  of  the  second  transaction  docs  not  depend  on  data 
modified  by  the  first2.  By  contrast.  W  R  dependencies  represent  a  transfer  of  information  between  the  two 
transactions.  In  the  absence  cf  any  additional  semantic  information,  it  must  be  assumed  that  an  abort  of  the 
first  transaction  will  affect  the  outcome  of  the  second,  which  must  therefore  also  be  aborted. 

Once  the  dependencies  that  could  lead  to  cascading  aborts  have  been  identified,  their  formation  must  be 
controlled.  Stated  in  terms  of  abstract  schedules:  starting  from  the  first  of  the  two  operations  that  form  the 
dependency  there  must  be  no  overlapping  of  the  two  transactions  in  the  schedule,  with  the  prior  transaction 
in  the  dependency  relation  completing  first.  Such  schedules  will  be  called  cascade-free.  Note  that  some 
consistent  schedules  may  not  be  cascade-free,  and  vice-versa. 

4  Specification  of  Shared  Abstract  Types 

This  section  focuses  on  the  typed  operations  that  make  up  transactions  and  discusses  how  to  specify  their 
local  synchronization  properties.  The  traditional  specification  of  an  abstract  type  describes  the  behavior  of 
the  type’s  operations  in  terms  of  preconditions,  postconditions,  and  an  invariant.  This  specification  must  be 
augmented  in  several  ways  to  complete  the  description  of  a  shared  abstract  type  in  our  model.  In  the  first 
place,  the  undo  operation  corresponding  to  each  regular  operation  must  be  specified  in  terms  of 


may  be  necessary  io  control  the  formation  of  these  dependencies  anyway,  if  an  insufficiently  flexible  recovery  strategy  is  used. 
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preconditions,  postconditions  and  die  invariant.  Specification  of  the  undo  operations  themselves  is  not 
considered  further  in  diis  paper.  It  is  important  to  note,  however,  diat  the  set  of  consistent  abstract  schedules 
defined  by  the  interleaving  specification  for  a  type  also  implicitly  includes  schedules  in  which  undo 
operations  arc  inserted  at  all  possible  points  after  an  operation  has  been  performed  but  prior  to  the  end  of  the 
invoking  transaction,  lTiis  reflects  the  assumption  that  it  must  be  possible  to  undo  any  operation  prior  to 
transaction  commitment.  As  will  be  shown  in  Section  4.3.  this  is  especially  important  for  types  that  do  not 
attempt  to  enforce  scriaii/ability  of  transactions. 

The  specification  of  a  shared  abstract  type  must  also  include  a  description  of  how  operations  on  behalf  of 
multiple  transactions  can  be  interleaved.  This  interleaving  specification  can  be  used  by  application 
programmers  to  describe  their  needs  to  prospective  type  implementors  or  to  evaluate  the  suitability  of  existing 
types  for  their  applications.  The  specification  of  a  shared  abstract  type  must  also  list  those  dependencies  that 
will  be  controlled  to  prevent  cascading  aborts.  This  part  of  the  specification  is  used  mainly  by  the  type's 
implementor. 

When  specifying  how  operations  on  a  shared  object  may  interact,  the  amount  of  concurrency  that  can  be 
permitted  depends  in  part  on  how  much  detailed  knowledge  is  available  concerning  the  semantics  of  the 
operations  [Kung  79].  We  have  shown  how  concurrency  controls  that  distinguish  those  operations  that  only 
observe  the  state  of  an  object  ("Reads”)  from  those  that  modify  it  ("Writes”)  can  achieve  greater  concurrency 
than  protocols  not  making  this  distinction.  To  increase  concurrency  further  while  still  providing 
serializabtiity.  one  can  take  advantage  of  more  semantic  knowledge  about  the  operations  being 
performed  [Korth  S3].  Section  4.1  illustrates  how  this  is  done  in  specifying  Directories,  using  the  concepts 
and  notation  of  the  last  section. 

When  enough  concurrency  cannot  be  obtained  even  after  fully  exploiting  the  semantics  of  the  operations  on 
a  type,  it  is  necessary  to  dispense  with  scrializability  and  substitute  orderability  with  respect  to  some  weaker 
set  of  proscribed  dependency  relations.  Sections  4 2  and  4J  illustrate  this  by  comparing  a  serializable  Queue 
type  with  a  variation  that  preserves  a  weaker  ordering  property. 

Finally,  Section  4.4  discusses  how  implementations  may  reorder  operations  to  obtain  even  more 
concurrency,  and  the  steps  that  type  implementors  must  take  to  demonstrate  the  correctness  of  an 
implementation. 


4.1  Directories 

As  a  first  example,  consider  a  Directory  data  type  (hat  is  intended  to  provide  a  mapping  between  text  strings 
and  capabilities  for  arbitrary  objects,  The  usual  operations  arc  provided: 

•  Dirlnscrt(dir,  str.  capa):  inserts  capa  into  Directory  dir  with  key  string  str.  Returns  ok  or  duplicate 
key.  'Hie  undo  operation  for  Dirlnscrt  removes  the  inserted  entry,  if  the  insertion  was  successful. 

•  l)irDelctc(dir.str):  deletes  the  capability  stored  with  key  string  str  from  dir.  Returns  ok  or  not 
found.  The  undo  operation  for  DirOclctc  restores  the  deleted  capability,  if  the  deletion  was 
successful. 

•  DirLookupfdir,  str):  searches  for  a  capability  in  dir  with  key  string  str.  Returns  the  capability  capa 
or  not  found.  The  undo  operation  is  null,  because  Dirl.ookup  docs  not  modify  the  Directory. 

•  DirDump(dir):  returns  a  vector  of  <str.capa>  pairs  with  the  complete  contents  of  the  Directory  dir. 

The  undo  operation  for  DirDump  is  null. 

Suppose  one  wishes  to  specify  the  Directory  type  so  as  to  permit  serialization  of  transactions  that  include 
operations  on  Directories.  One  approach  would  be  to  model  each  Dirlnscrt  or  DirDelctc  operation  as  a  Read 
operation  followed  by  a  Write  operation,  and  to  mode!  each  DirLookup  or  DirDump  operation  as  a  Read 
operation.  The  Directory  type  could  then  be  specified  using  the  Rcad/Writc  dependency  relations  discussed 
previously. 

The  difficulty  with  using  such  limited  semantic  information  is  that  concurrency  is  restricted  unnecessarily. 
For  example,  suppose  Directories  have  been  implemented  using  a  standard  two-phase  Read/Writc  locking 
mechanism.  Consider  the  operation  DirLookupfdir,  "Foo"),  which  will  be  blocked  trying  to  obtain  ?  Read 
lock  if  another  transaction  has  performed  DirDclctc(dir,  "Fum")  and  holds  a  Write  lock  on  the  Directory 
object.  The  outcome  of  DirLookupfdir,  "Foo")  docs  not  depend  in  any  way  on  the  eventual  outcome  of 
DirDcletcKdir,  "Fum")  (which  may  later  be  aborted),  or  vice-versa,  so  this  blocking  is  unnecessary.  Because 
DirDclctefdir,  "Fum")  may  be  part  of  an  arbitrarily  long  transaction,  the  Write  lock  may  be  held  for  a  long 
time  and  severely  degrade  performance. 

The  unnecessary  loss  of  concurrency  in  this  example  is  not  the  fault  of  this  particular  implementation.  It  is 
caused  by  the  lack  of  semantic  information  in  the  Directory  specification.  By  using  more  knowledge  about 
the  operations,  this  problem  can  be  alleviated.  Instead  of  expressing  the  interleaving  specification  for  this 
type  in  terms  of  Read  and  Write  operations,  the  type-specific  Directory  operations  can  be  employed  to  define 
dependencies  and  the  interleaving  specifications  can  be  expressed  in  terms  of  these  type-specific 
dependencies. 

To  keep  the  number  of  dependencies  to  a  minimum,  the  operations  for  the  Directory  data  type  will  be 
divided  into  three  groups: 


•  Those  that  modify  a  particular  entry  in  the  Directory.  Dirlnscrt  and  DirDcletc  operations  that 
succeed  arc  in  this  class.  These  arc  Modify  (M)  operations. 

•  Those  that  observe  the  presence,  absence,  or  contents  of  a  particular  entry  in  the  Directory. 
DirLookup  is  in  this  class,  as  arc  Dirlnscrt  and  DirDcletc  operations  that  fail.  These  arc  Lookup 
(L)  operations. 

•  Those  that  observe  properties  of  the  Directory  that  cannot  be  isolated  to  an  individual  entry. 
l):rl)uinp  is  the  only  operation  in  this  class  that  we  have  defined;  an  operation  that  returned  the 
number  of  entries  in  the  Directory  would  also  be  in  this  class.  Ihcsc  arc  Dump  (I))  operations. 

Note  that  in  some  eases  operations  that  fail  arc  distinguished  from  those  that  succeed.  In  addition  to  the 
operations  and  their  outcomes,  the  dependencies  also  take  into  account  data  supplied  to  the  operations  as 
arguments  or  otherwise  specific  to  the  particular  object  acted  upon.  In  the  following  list  of  dependencies,  the 
symbols  a  and  o'  represent  distinct  key  string  arguments  to  Directory  operations. 

The  complete  set  of  dependencies  for  this  type  is: 

•  Dj:  Tj:M(a)  -♦  T.:M(cr’).  T.  modifies  an  entry  with  key  string  a,  and  T  subsequently  modifies  an 
entry  with  a  different  key  string,  o'. 

•  D2:  TjiMfff)  —  T:M(<t).  T  modifies  an  entry  with  key  string  a ,  and  T.  subsequently  modifies 
die  same  entry. 

•  Dji  Tj'.M(<x)  —  T.:L(ff').  T.  modifies  an  entry  with  key  string  a,  and  T.  subsequently  observes  an 
entry  with  a  different  key  string,  o’. 

•  D4:  T,:M(«j)  -*  T^Lftr).  T  modifies  an  entry  with  key  string  a,  and  T.  subsequently  observes  the 
same  entry. 

•  Ds:  T.:L(a)  -*  TjiLfc').  Tj  observes  an  entry  with  key  string  a,  and  Tj  subsequently  observes  an 
entry  with  a  different  key  string  o'. 

•  D6:  T,:L(«r)  -♦  T^Ua).  Ti  observes  an  entry  with  key  string  <r,  and  T.  subsequently  observes  the 
same  entry. 

•  D?:  TpLfff)  -*  T.:M(o').  T  observes  an  entry  with  key  string  <r,  and  Tj  subsequently  modifies  an 
entry  with  a  different  key  string  o'. 

•  Dg:  T,:L(a)  -*  T.:M(«r).  T.  observes  an  entry  with  key  string  o,  and  Tj  subsequently  modifies  the 
same  entry. 

•  D9:  TpD  -*  T  :M(ff).  T;  dumps  the  entire  contents  of  the  Directory,  and  Tj  subsequently 
modifies  an  entry  with  key  string  o. 

•  D1():  T:D  -♦  T.:L(<t).  Tj  dumps  the  entire  contents  of  the  Directory,  and  Tj  subsequently 
observes  an  entry  with  key  string  o. 
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•  Dj,:  Tj:M(<r)  — *  T:D.  T  modifies  an  entry  with  key  suing  a,  anti  T  subsequently  dumps  the 
entire  contents  of  the  Directory. 

•  Dp:  T:I.(a)  — *  T.:D.  T.  observes  an  entry  with  key  string  a.  and  T.  subsequently  dumps  the 
entire  contents  of  t/ie  Directory. 

•  Dp.'  T.:D  — *  T.:D.  T  dumps  the  entire  contents  of  the  Directory  and  T.  subsequently  dumps  the 

Directory  as  well.  J 

This  list  is  long,  but  it  is  actually  quite  simple  to  derive.  There  is  a  family  of  dependencies  for  each  pair  of 
operation  classes.  The  key  to  defining  the  specific  dependencies  is  the  observation  that  when  two  operations 
refer  to  different  strings,  the  relationship  between  the  transactions  that  invoked  them  is  not  the  same  as  when 
they  refer  to  identical  strings.  Those  families  of  dependencies  for  which  both  operation  classes  take  a  string 
argument  therefore  have  two  members,  corresponding  to  these  two  eases.  The  families  for  which  one  of  the 
operation  classes  is  Dump  have  only  a  single  member.  In  general,  insight  into  the  semantics  of  a  type  is 
needed  to  define  the  set  of  possible  dependencies. 

Like  the  R  — »  R  dependency,  many  of  the  Directory  dependencies  arc  insignificant  and  cannot  affect  the 
outcome  of  transactions.  Hence,  they  may  be  excluded  from  the  set  of  proscribed  dependencies  for  this  type. 
The  dependencies  diat  may  be  disregarded  arc: 

•  Those  for  which  neither  operation  in  the  dependency  modifies  the  Directory  object:  Dfi>  D^,, 
and  Du.  These  arc  directly  analogous  to  the  R  -*  R  dependency. 

•  Those  for  which  the  two  operations  in  the  dependency  refer  to  different  key  strings:  Dr  D3,  Dy 
and  Dr 

In  terms  of  the  remaining  dependencies,  the  interleaving  specification  for  Directories  states  that  an  abstract 
schedule  involving  Directories  is  consistent  if  it  is  orderable  with  respect  to  {<D  uD  uD  uD  uD  }.  The 
abstract  Directory  thus  defined  behaves  like  a  collection  of  associativcly-addressed  elements,  with 
scrializability  prcscrvable  independently  for  each  element.  Transactions  containing  operations  that  apply  to 
die  entire  Directory,  such  as  DirDump,  may  also  be  serialized,  as  may  those  that  refer  to  multiple  elements  or 
elements  that  are  not  present. 

Only  two  of  the  Directory  dependencies  have  the  potential  to  cause  cascading  aborts.  These  arc  D4  and 
Du.  In  both  eases,  the  first  operation  in  the  dependency  modifies  an  entry  and  the  second  operation  observes 
that  modification. 


4.2  FIFO  Queues 

Similar  specifications  can  be  developed  for  other  data  types.  The  FIFO  Queue  provides  an  interesting 
example.  We  will  only  consider  two  operations: 

«  QKntcrfqueuc,  capa):  Adds  an  entry  containing  the  pointer  capa  to  the  end  of  queue.  'I'hc  undo 
operation  for  QEntcr  removes  this  entry. 

•  QRcmovc{qucuc):  Removes  the  entry  at  the  head  of  queue  and  returns  the  pointer  capa  contained 
therein.  If  queue  is  empty,  the  operation  is  blocked,  and  waits  until  queue  becomes  non-empty. 

The  undo  operation  for  Qlicinovc  restores  the  entry  to  the  head  of  queue. 

In  order  to  permit  serialization  of  transactions  that  contain  operations  on  strict  FIFO  Queues,  and  to 
prevent  cascading  aborts,  numerous  properties  must  be  guaranteed.  For  instance: 

•  If  a  transaction  adds  several  entries  to  a  Queue,  these  entries  must  appear  together  and  in  the  same 
order  at  the  head  of  the  Queue. 

•  Any  entries  added  to  a  Queue  by  a  transaction  may  not  be  observed  by  another  transaction  unless 
the  first  transaction  terminates  successfully. 

•  If  two  transactions  each  make  entries  in  two  Queues,  the  relative  ordering  of  the  entries  made  by 
the  two  transactions  must  be  the  same  in  both  Queues. 

It  is  very  easy  to  destroy  these  properties  if  unrestricted  interleaving  of  operations  is  allowed.  For  instance, 
if  QEntcr  operations  from  different  transactions  are  interleaved,  the  entries  made  by  each  transaction  will  not 
appear  in  a  block  at  the  head  of  the  Queue. 

In  defining  the  dependencies  for  the  Queue  type,  it  is  necessary,  as  it  was  in  the  ease  of  Directories,  to 
distinguish  individual  elements  in  the  Queue.  It  is  assumed  that  each  element  is  assigned  a  unique  identifier3 
when  it  is  entered  on  the  Queue.  The  symbols  a  and  a'  are  used  to  represent  the  distinct  identifiers  of 
different  elements,  and  the  QEntcr  and  QRemove  operations  are  abbreviated  as  E  and  R  respectively.  The 
complete  set  of  dependencies  for  Queues  is: 

•  Dj:  Tj:E(ff)  — >q  Tj.'Efa').  Tj  enters  an  element  cr’  into  the  queue  Q  after  T  has  previously 
entered  an  element  a. 

•  D2:  T^Efff) -*q  T.:R(<y’)-  T^  removes  element  <j’  after  T;  entered  element  a. 

•  D3:  Tj:E(<r)  -*q  Tj:R(<r).  Tj  removes  the  element  a  that  was  entered  by  T}. 

•  D4:  Tj:R(<r)  — *q  T.:E(o').  T.  enters  element  a'  after  T  removed  element  a. 

identifier  need  not  be  globally  unique,  just  unique  among  those  generated  for  the  particular  Queue  objea 
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?  D5:  T:R(o)  — >q  T.:R(<t').  T.  removes  element  o’  after 'IV  removed  element  o. 

In  a  Rcad/Writc  synchronization  scheme.  QF.ntcr  must  be  modeled  as  a  Write  operation,  and  QRctnove 
must  be  modeled  as  a  Read  followed  by  a  Write.  Recall  that  such  a  scheme  must  prevent  cycles  in  the 
<R_W  u  w— r  u  w— w  dependency  relation.  In  this  case,  preventing  cycles  in  this  general  dependency 
relation  is  unnecessarily  restrictive.  Consider  dependency  D2,  which  is  formed  when  a  transaction  removes  a 
Queue  element  after  another  transaction  has  previously  entered  a  different  Queue  element.  Neither  of  the 
transactions  performing  the  operations  can  detect  their  ordering,  nor  can  a  third  transaction.  The  same 
applies  to  dependency  D4.  which  is  the  inverse  of  D2>  As  was  cite  ease  for  Directories,  concurrency  can  be 
increased  by  disregarding  insignificant  dependencies. 

To  provide  a  strictly  FIFO  Queue,  one  must  guarantee  that  abstract  schedules  arc  ordcrable  with  respect  to 
the  compound  <D  uD  u  D  relation,  but  cycles  may  be  permitted  to  form  in  relations  that  include  D2  or  D4  as 
long  as  this  property  is  not  violated.  For  example,  consider  the  following  schedule,  in  which  two  transactions 

operate  on  a  Queue  that  initially  contains  {A,  B}: 

Tj:  QEnter(Q.X) 

T,:  QRemove(Q)  returns  A 
Tj :  QEnter(Q.Y) 

At  step  2  of  this  schedule  a  D2  dependency  is  formed,  hence  T;  <Q  T2.  At  step  3.  however,  a  D4  dependency 
is  formed  with  T2  <D  Tr  Geariy  a  cycle  exists  in  the  compound  relation  <D  yD  .  It  is  easy  to  create  other 
examples  of  consistent  abstract  schedules  that  demonstrate  a  cycle  in  the  basic  <_  (or  <n  )  relation,  or  in  a 

u2  U4 

compound  relation  formed  from  D2  (or  D4)  together  with  Dr  D3  and  Dj. 

The  dependency  relations  can  also  be  used  to  characterize  schedules  susceptible  to  cascading  abort. 

Dependency  relation  <n  is  similar  to  the  W  — *  W  dependency.  Since  entries  made  by  an  aborted  transaction 
ul 

can  be  transparendy  removed  from  the  Queue,  there  is  no  danger  of  cascading  abort  Relations  <n  and  <-. 

u3  u5 

are  more  similar  to  W  — *  R  dependencies.  In  a  D3  dependency,  information  is  transferred  between  the 
transactions  in  the  form  of  the  queue  element  a;  this  dependency  clearly  can  cause  cascading  aborts.  A  Dj 
dependency  can  also  cause  cascading  aborts,  because  the  removal  of  an  element  by  the  first  transaction  affects 
which  element  is  received  by  the  second  transaction. 

While  this  definition  of  consistency  for  Queues  is  an  improvement  over  a  Rcad/Write  scheme,  it  is  still  very 
restrictive  of  concurrency.  It  allows  at  most  two  transactions,  one  performing  QEntcr  operations  and  one 
performing  QRcraove  operations,  to  access  a  Queue  concurrently.  Unlike  the  Directory,  the  Queue  is 
intended  to  preserve  a  particular  ordering  of  the  elements  contained  in  it.  A  system  based  on  serializable 
transactions  guarantees  that  transactions  can  be  placed  in  some  order;  by  enforcing  a  particular  order,  data 
types  such  as  queues  (and  stacks)  restrict  concurrency. 


s 


B-16 


4.3  Queues  Allowing  Greater  Concurrency 

The  preceding  examples  show  how  the  use  of  semantic  knowledge  about  operations  on  a  shared  abstract 
type  permits  increased  concurrency.  Once  such  knowledge  is  incorporated,  the  limiting  factor  in  permitting 
concurrency  becomes  knowledge  about  the  consistency  constraints  that  the  operations  in  a  transaction 
attempt  to  maintain  [Kung  79).  This  knowledge  concerns  the  semantics  of  groups  of  operations  rather  than 
individual  ones.  Kor  example,  a  consistency  constraint  might  state  dial  every  Queue  entry  of  type  A  is 
immediately  followed  by  one  of  type  B.  'ITic  potential  for  such  constraints  was  the  cause  of  the  concurrency 
limitations  observed  above. 

If  it  is  possible  to  restrict  die  consistency  constraints  that  a  programmer  is  free  to  require,  types 
guaranteeing  ordering  properties  weaker  than  scrializability  may  be  acceptable.  This  may  permit  further 
increases  in  concurrency.  A  variation  of  the  queue  type  can  be  used  to  demonstrate  this. 

One  of  the  most  common  uses  for  a  queue  is  to  provide  a  buffer  between  activities  that  produce  and 
consume  work.  Frequently,  the  exact  ordering  of  entries  on  the  queue  is  not  important.  What  is  crucial  is 
that  entries  put  on  the  rear  of  the  queue  do  not  languish  in  the  queue  forever;  they  should  reach  the  head  of 
the  queue  "fairly"  with  respect  to  other  entries  made  at  about  the  same  time.  A  data  type  having  this 
non-starvation  property  can  be  defined:  the  V/eaklyFIFO  Queue  (WQucuc  for  short).  A  similar  type,  the 
Semi-Queue,  has  been  defined  by  Wcihl  [Wcihl  83b). 

The  operations  on  WQueucs  and  their  corresponding  undo  operations  arc  similar  to  those  for  Queues,  but 
the  interleaving  specification  for  WQueucs  allows  more  concurrency.  The  dependencies  for  the  WQueue 
type  are  the  same  as  for  the  strict  Queue.  However,  where  the  strict  Queue  required  that  consistent  abstract 
schedules  be  ordcrable  with  respect  to  {<D  uD  uD  },  the  WQueue  permits  cycles  to  occur  in  all  the 
dependency  relations  save  one:  <D  .  By  allowing  cycles  in  <D  ,  the  interleaving  of  entries  by  multiple 
transactions  becomes  possible.  Similarly,  removing  Ds  from  the  set  of  proscribed  dependency  relations 
peimits  WQRcmove  operations  to  be  interleaved. 

To  take  full  advantage  of  the  greater  concurrency  allowed  by  this  interleaving  specification,  the  semantics  of 
WQRcmove  differ  slightly  from  those  of  Q  Remove.  If  the  transaction  that  inserted  the  headmost  entry  in  the 
queue  has  not  committed,  that  entry  cannot  be  removed  without  risking  the  possibility  of  a  cascading  abort. 
Instead,  WQRemove  scans  the  WQueue  and  removes  the  headmost  entry  for  which  the  inserting  transaction 
has  committed.  If  no  such  element  can  be  found,  any  elements  inserted  by  the  transaction  doing  the 
WQRcmove  become  eligible  for  removal.  If  neither  a  committed  entry  nor  one  inserted  by  the  same 
transaction  is  available,  the  operation  is  blocked  until  an  inserting  transaction  commits. 
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Modifying  ihc  semantics  of  WQRcmovc  in  this  way  docs  not  destroy  die  fairness  properties  of  the  WQucuc. 
No  entry  will  remain  in  die  WQucuc  forever  if: 

1.  The  transaction  dial  entered  it  commits  in  a  finite  amount  of  time. 

/ 

2  Transactions  that  remove  it  terminate  after  a  finite  amount  of  dme. 

3.  Only  a  finite  number  of  transactions  remove  die  entry  and  then  abort. 


The  behavior  of  die  WQucuc  is  best  illustrated  by  example.  In  what  follows,  a  WQucuc  is  represented  by  a 
sequence  of  letters,  with  the  left  end  of  the  sequence  being  the  head  of  the  WQucuc.  Lower  ease  italic  letters 
(a)  arc  used  to  denote  entries  for  which  the  VVQKntcr  operation  has  not  committed  (i.c.  the  transaction  that 
performed  WQEntcr  is  incomplete).  Upper  ease  bold  letters  (A)  arc  used  to  represent  entries  that  have  not 
been  removed  and  for  which  the  entering  transaction  has  committed.  Upper  ease  italic  letters  arc  used  for 
entries  that  have  been  removed  by  an  uncommitted  WQ Remove.  Superscripts  on  entries  affected  by 
uncommitted  operations  identify  the  transaction  that  performed  the  operation. 


Assume  that  the  WQueuc  is  initially  empty.  If  transactions  Tx  and  T2  perform  WQEntcrfWQ,  a)  and 
WQEntcr(WQ,  b)  respectively,  the  WQueue’s  state  becomes: 

W.b2} 

Since  cycles  in  <D  are  permitted.  T;  may  also  add  another  entry,  yielding: 

W.  bt.c1}1 

If  T,  and  T,  both  commit,  the  state  becomes: 

{A.B.C} 

Note  that  the  serializability  of  Tx  and  T2  has  hqi  been  preserved.  Now  suppose  that  T3  performs  WQRcmove 
and  another  transaction,  T4.  removes  two  more  elements: 

{A3,  B4,  C4} 

IfTj  now  aborts  and  T4  commits,  the  final  state  becomes: 

{A} 

In  this  case,  A  and  C  have  effectively  been  reversed,  even  though  they  were  inserted  initially  by  the  same 
transaction!  This  example  illustrates  an  important  difference  between  shared  abstract  types  due  auempt  to 
preserve  serializability  and  those  that  do  not:  when  a  type  permits  non*serial  execution  of  transactions, 
invoking  an  operation  and  subsequently  aborting  it  is  not  necessarily  equivalent  to  not  invoking  the  operation 
at  alL  While  we  do  not  explicitly  consider  the  undo  operations  in  defining  dependencies  or  interleaving 
specifications,  the  underlying  assumption  that  aborts  can  occur  at  any  time  prior  to  commit  implies  that  undo 
operations  can  be  inserted  at  any  point  in  a  schedule  between  the  invocation  of  an  operation  and  the  time  at 
which  the  invoking  transaction  commits. 


Another  example  indicates  what  happens  when  an  uncommitted  entry  reaches  die  head  of  die  Queue. 
Suppose  the  initial  state  is: 

lfTfi  commits  butTj  remains  incomplete,  the  state  becomes: 

{tf5.  B> 

lfT?  removesan  element  at  this  time.  B  will  be  returned,  leaving: 

{aS} 

after  T?  commits.  On  the  other  hand,  if  T$  commits  after  Tfi.  but  before  die  remove  by  T?,  A  will  be  returned 
even  though  its  insertion  was  committed  after  B's. 

To  summarize  the  comparison  between  the  WQueue  and  the  ordinary  Queue,  note  that  two  properties  of 
the  regular  Queue  have  been  sacrificed.  First,  strict  FIFO  ordering  of  entries  is  not  guaranteed,  because 
aborting  WQRcmovc  operations  can  reorder  them.  Second,  transactions  that  operate  on  WQucucs  arc  not 
necessarily  serializable  with  respect  to  all  transactions  in  the  system.  Some  other  crucial  properties,  however, 
are  preserved.  The  WQueue  will  not  starve  any  entry,  and  it  enforces  an  ordering  of  those  transactions  that 
communicate  through  access  to  a  common  element  of  the  queue.  This  is  ensured  by.  ordcrability  with  respect 


many  situations. 


4.4  Proving  the  Correctness  of  Type  Implementations 
Whereas  the  user  of  a  type  may  employ  the  specified  properties  of  abstract  schedules  (along  with  the  rest  of 
the  type’s  specification)  to  reason  about  the  correctness  of  transactions,  the  implementor  of  a  type  must  prove 
the  correctness  of  an  implementation  given  the  order  in  which  operations  are  actually  invoked.  Real 
implementations  may  reorder  the  operations  on  an  object  to  improve  concurrency  without  changing  the  type's 
interleaving  specification.  Consider  an  implementation  of  the  Queue  type  in  which  elements  to  be  entered  by 
a  transaction  are  first  collected  in  a  transaction-local  cache  and  entered  as  a  block  at  end-of-transaction.  This 
implementation  allows  any  number  of  transactions  to  invoke  the  QEnter  operation  simultaneously,  provided 
care  is  taken  to  serialize  correctly  transactions  involving  multiple  Queues.  By  actually  performing  the 
insertions  as  a  block,  this  implementation  effectively  reorders  the  individual  QEnter  operations  to  preserve 
consistency.  It  is  possible  to  reorder  QEnter  operations  in  this  way  because  QEnter  does  not  return  any 
information  to  its  caller.  Formation  of  any  dependencies  that  might  result  from  its  invocation  can  therefore 
be  postponed.  The  ultimate  ordering  of  operations  in  the  abstract  schedule  is  determined  by  the 
implementation  once  all  the  QEnter  operations  to  be  performed  by  a  given  transaction  arc  known.  Thus,  this 
implementation  has  the  benefit  of  more  knowledge  about  transactions  than  has  he  standard  implementation. 


invocation  schedules  lisl  operations  in  the  order  in  which  they  arc  actually  invoked,  rather  than  in  order  of 
their  abstract  effects4.  For  example,  the  following  is  a  possible  invocation  schedule  for  a  Queue  implemented 

using  the  block-insertion  technique  described  above: 

T  :  QEnter(Q.Y) 

Tj :  QEnter(Q.X) 

T3:  QRemove(Q) 

IF  Tj  commits  before  T2.  the  implementation  reorders  the  two  QKntcr  operations,  resulting  in  the  abstract 
schedule: 

T  :  QEnterfQ,  X) 

T2:  QEnterfQ.  Y) 

T3:  QRemove(Q) 

The  mapping  between  invocation  schedules  and  abstract  schedules  is  many-one;  each  invocation  schedule 
implements  exactly  one  abstract  schedule,  but  an  abstract  schedule  may  be  implemented  by  multiple 
invocation  schedules.  The  synchronization  mechanism  used  by  an  implementation  determines  a  set  of 
invocation  schedules,  called  legal  schedules,  that  are  permitted  by  the  implcmcntauon.  The  implementor 
must  show  that  all  legal  invocation  schedules  map  to  consistent  abstract  schedules.  To  prevent  cascading 
aborts  as  well,  implementors  must  use  a  synchronization  strategy  that  restricts  die  set  of  legal  invocation 
schedules  to  those  that  map  to  abstract  schedules  that  arc  in  the  intersection  of  the  consistent  and  cascade-free 
sets. 

5  Orderability  of  Groups  of  Transactions 

The  preceding  section  described  how  the  standard  specification  of  an  abstract  type,  which  only  seeks  to 
characterize  the  type's  invariants  and  the  postconditions  for  its  operations,  can  be  augmented  with  an 
interleaving  specification  that  describes  the  local  synchronization  properties  of  objects.  In  this  section  we 
broaden  our  focus  from  the  properties  of  the  typed  objects  that  arc  manipulated  by  transactions  to  the 
properties  of  entire  transactions.  We  first  examine  how  to  generalize  the  definition  of  consistent  abstract 
schedules  to  schedules  that  include  operations  on  more  than  one  object  type,  and  then  consider  how  ordering 
properties  of  groups  of  transactions  can  be  used  to  show  their  correctness. 


4k  is  assumed  that  the  actual  concurrent  execution  of  the  transactions  on  be  modeled  by  a  linear  ordering  of  their  component 
operations.  This  requires  that  the  primitive  operations  be  (abstractly)  atomic.  In  the  multiprocessor  case,  all  linearizations  of  operations 
that  could  occur  simultaneously  yield  distinct  invocation  schedules. 


S.'S  How  the  Specifications  of  Multiple  Types  Interact 


Guaranteeing  ordcrability  with  respect  to  the  proscribed  relations  of  a  collection  of  individual  types  is  not 
sufficient  to  ensure  global  ordering  properties  of  transactions,  such  as  scriaiizability.  Consider  die  following 
schedule,  which  contains  transactions  that  operate  both  on  Queues  and  Directories.  Each  of  these  types 
preserves  ordcrability  with  respect  to  the  union  of  all  significant  dependencies  for  the  individual  type,  in 
order  that  transactions  involving  the  type  may  potentially  be  serialized.  However,  this  propeity  alone  does 

not  guarantee  scriaiizability  of  die  transactions.  For  example,  the  following  schedule  is  not  serializable: 

T.:  QEnter(Q.X) 

T.:  QEnter(Q.Y) 

T  :  D1rlnsert(0.  "A".  Z) 

Tx :  D1rDelete(D,  "A") 

Let  <Djr  stand  for  the  <D  uD  uD  uD  uD  relation,  defined  earlier  for  type  Directory.  Let  <Q  stand  for  the 
<n  uD  uD  relation,  defined  earlier  for  Queues.  Although  die  schedule  is  ordcrable  with  respect  to 
{<1^,  <q}.  it  is  not  serializable.  To  achieve  scriaiizability,  the  Queue  and  Directory  types  must  cooperate  to 
prevent  cycles  in  the  relation  The  schedule  is  not  ordcrable  with  respect  to  this  compound 

dependency. 


This  example  indicates  how  to  generalize  the  definition  of  consistency  to  apply  to  abstract  schedules 
containing  operations  on  multiple  types.  Assume  the  interleaving  specification  for  type  Yj  guarantees 
ordcrability  with  respect  to  {<D  },  the  interleaving  specification  for  type  Y2  guarantees  ordcrability  with 
respect  to  {<D  },  etc.  The  set  of  consistent  abstract  schedules  involving  types  Yr  Y,, ...  Yq  is  defined  as  those 
abstract  schedules  that  are  ordcrable  with  respect  to  {<s  uD  y  uD  }:  the  union  of  the  proscribed 
dependency  relations  of  the  individual  types.  A  set  of  types  whose  implementations  satisfy  this  property  is 
called  a  set  of  cooperative  types. 

The  need  for  cooperation  among  types  does  not  necessarily  imply  that  whenever  a  system  is  extended  by 
the  definition  of  a  new  type,  the  synchronization  requirements  of  all  existing  types  must  be  rethought.  When 
designing  a  system,  however,  the  implementors  of  cooperative  types  must  first  agree  on  a  synchronization 
mechanism  that  is  sufficiently  flexible  and  powerful  to  meet  all  of  their  requirements.  A  poor  choice  of 
mechanism  for  fundamental  building-block  types  will  have  an  adverse  effect  on  the  entire  system.  Section 
6  describes  a  mechanism  based  on  locking  that  permits  highly  concurrent  implementations  of  a  large  variety 
of  shared  abstract  types. 
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5.2  Correctness  of  Transactions 

When  all  of  the  types  involved  in  a  group  of  transactions  cooperate  to  preserve  an  ordering  property 
equivalent  to  scrializabiiity.  it  is  easy  to  show  that  the  correctness  of  transactions  is  not  affected  by 
concurrency.  Because  transactions  arc  completely  isolated  from  one  another,  a  transaction  can  be  proven 
correct  solely  on  the  basis  of  its  own  code  and  the  assumption  that  the  system  state  is  correct  when  the 
transaction  is  initiated. 

It  is  much  more  difficult  to  prove  the  correctness  of  transactions  when  they  include  operations  on  types  that 
permit  non-serializable  interaction  among  transactions.  One  must  consider  the  possible  effects  of  interleaving 
each  transaction  with  any  other  transaction,  subject  to  the  constraints  of  whatever  ordering  property  is 
guaranteed  by  the  collection  of  types.  Ncvcrthci'^s,  in  many  practicai  situations,  this  task  should  not  be 
insurmountable.  We  give  two  examples  of  situations  where  it  is  possible  to  make  useful  inferences  about  the 
behavior  of  transactions  even  though  they  preserve  an  ordering  property  weaker  than  scrializabiiity. 

Users  often  invoke  the  OirDump  operation  on  a  Directory  when  they  are  "just  looking  around."  In  such 
eases,  users  would  like  to  see  a  snapshot  of  the  Directory's  contents  at  an  instant  when  the  status  of  each  entry 
is  well  defined,  but  they  don’t  care  what  happens  to  the  Directory  thereafter.  If  ail  Directory  operations 
attempt  to  enforce  scrializabiiity,  using  DirDump  in  this  way  could  greatly  restrict  concurrency.  This  problem 
can  be  alleviated  by  modifying  the  specification  of  the  Directory  type  to  permit  limited  non-serializable 
behavior. 


Suppose  dependency  relations  containing  D?:  T.:D  — *  Tj:M(<r)  are  removed  from  the  set  of  proscribed 
relations  for  the  modified  Directory  type.  That  is,  the  interleaving  specification  for  Directories  only  requires 
ordcrability  with  respect  to  {<D uD uD uD  }  instead  of  (<D,un  uDe uDqUd  AIthou8h  **  modified 
Directory  allows  non-serializable  behavior,  one  can  still  guarantee  that  certain  consistency  constraints  are  not 
violated.  For  example,  if  a  transaction  replaces  a  group  of  entries  in  a  Directory,  one  can  still  prove  that  no 
other  transaction  doing  DirLookup  operations  will  observe  an  incompatible  collection  of  entries. 


The  WQueue  of  section  4.3  provides  another  example  of  a  useful  type  that  permits  non-serializable 
interaction  of  transactions.  Although  the  ordering  property  for  WQueues  is  weaker  than  the  one  for  strict 


Queues,  some  interesting  properties  can  still  be  deduced  based  only  on  orderability  with  respect  to 


« 


Consider  two  transactions,  T1  and  T2,  and  two  WQueues,  Q2  and  Q2.  Suppose  T1  is  intended  to  move  all 


elements  from  Ql  to  Q2  and  T2  is  intended  to  move  all  elements  from  Q2  to  Qr  If  these  transactions  are  run 
concurrently,  the  elements  should  all  wind  up  in  one  WQueue  or  the  other.  This  can  be  guaranteed  only  if 


is  proscribed;  otherwise  elements  could  be  shuffled  endlessly  between  Q2  and  Q2  and  the  transactions 


might  never  terminate. 
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6  A  Technique  for  Synchronizing  Shared  Abstract  Types 

Wc  have  developed  a  formalism  for  specifying  the  synchronization  of  operations  on  shared  abstract  types, 
and  interleaving  specifications  for  some  example  types  have  been  given.  This  section  outlines  a 
synchronization  mechanism  that  can  be  used  in  implementations  of  these  types.  While  we  do  not  describe  a 
particular  syntax  or  implementation  for  this  mechanism,  wc  show  how  it  can  be  used  to  prevent  cascading 
aborts  and  control  the  interleaving  of  operations.  We  show  how  it  provides  the  cooperation  among  types  that 
is  needed  to  preserve  scrializability  or  a  weaker  ordering  property  of  a  group  of  transactions.  Implementation 
sketches  for  the  shared  abstract  types  specified  in  Section  4  arc  given  as  examples  of  its  use. 

As  indicated  in  Section  4.4.  the  implementor  of  a  type  must  take  the  following  steps  to  demonstrate  the 
correctness  of  an  implementation: 

1.  characterize  the  set  of  legal  invocation  schedules,  that  is.  those  invocation  schedules  allowed  by 
the  synchronization  mechanism  used  in  the  implementation. 

2.  give  a  mapping  horn  invocation  schedules  to  abstract  schedules,  and  prove  that  the 
implementation  carries  out  this  mapping. 

3.  prove  that  every  legal  invocation  schedule  yields  a  consistent  abstract  schedule  under  this 
mapping. 

This  three-part  task  is  simplest  for  implementations  that  arc  idealized  in  that  they  do  not  reorder  operations 
on  objects.  Under  these  conditions,  invocation  schedules  and  abstract  schedules  are  equivalent,  and  the 
second  step  in  this  process  can  be  eliminated.  The  examples  in  this  section  discuss  such  idealized 
implementations  of  types. 

6.1  Type-Specific  Locking 

The  proposed  synchronization  technique  is  based  on  locking,  which  is  used  in  many  database  systems  to 
synchronize  access  to  database  objects.  There  are  many  variations  on  locking,  but  the  same  basic  principle 
underlies  them  all:  before  a  transaction  is  permitted  to  manipulate  an  object,  it  must  obtain  a  lock  on  the 
object  that  will  restrict  further  access  to  the  object  by  other  transactions  until  the  transaction  holding  the  lock 
releases  it 

Locking  restricts  the  formation  of  dependencies  between  transactions  by  restricting  the  set  of  legal 
invocation  schedules.  Whenever  one  transaction  is  forced  to  wait  for  a  lock  held  by  another,  the  formation  of 
a  dependency  between  the  two  transactions  is  delayed  until  the  first  transaction  releases  the  lock.  Under  the 
well-known  two-phase  locking  protocol  [Eswaran  76],  no  transaction  releases  a  ,'ock  until  it  has  already  claimed 
all  the  locks  it  will  ever  claim.  This  has  the  effect  of  converting  potential  cycles  in  dependency  relations  into 


deadlocks  instead.  These  can  be  detected,  and  because  no  dependencies  have  yet  been  allowed  to  form,  either 
transaction  can  be  aborted  without  affecting  the  other. 

Locking  is  a  conservative  policy,  because  it  delays  the  formation  of  any  dependency  that  is  part  of  a 
proscribed  relation,  not  just  those  that  eventually  lead  to  cycles.  This  is  not  as  significant  a  disadvantage  as  it 
might  appear,  however,  because  formation  of  those  dependencies  that  transfer  information  (see  Section  3.3) 
must  be  delayed  anyway  to  prevent  cascading  aborts.  In  fact,  the  even  more  restrictive  strategy  of  holding 
certain  locks  until  cnd-of-trnnsaction  must  often  be  employed  to  ensure  that  schedules  arc  cascade-free. 
Furthermore,  it  is  the  conservative  nature  of  locking  protocols  that  makes  them  a  suitable  mechanism  for  sets 
of  cooperative  types.  By  preventing  the  formation  of  any  dependencies  local  to  a  single  object,  cycles  in 
proscribed  relations  that  involve  multiple  types  arc  automatically  avoided  without  explicit  communication 
between  type  managers.  This  is  an  important  advantage,  because  it  allows  type  managers  to  be  constructed 
independently,  as  long  as  they  correctly  prevent  the  local  formation  of  dependencies. 

The  chief  disadvantage  of  many  locking  mechanisms  is  that  they  sacrifice  concurrency  by  making  minimal 
use  of  semantic  knowledge  about  the  objects  being  manipulated.  The  simplest  locking  schemes  use  only  one 
type  of  lock,  and  hence  cannot  distinguish  between  significant  and  insignificant  dependencies.  Rcad/Writc 
locking  schemes  use  some  semantic  information,  but  arc  not  flexible  enough  to  take  advantage  of  the  extra 
concurrency  specifiable  in  terms  of  type-specific  dependencies.  It  has  been  shown  [Kung  79]  that  two-phase 
locking  is  optimal  under  such  conditions  of  limited  semantic  knowledge,  but  much  more  concurrency  can  be 
obtained  if  more  semantic  information  is  used.  The  locking  technique  described  here  generalizes  the  ideas 
behind  Rcad/Write  locking.  It  permits  the  definition  of  type-specific  locking  rules  chat  reflect  the 
interleaving  specifications  of  individual  data  types.  More  restrictive  typc-spccific  locking  schemes  have 
previously  been  investigated  by  Korth  [Korth  83], 

Two  observations  can  be  made  concerning  type-specific  dependencies.  First,  they  specify  the  way  in  which 
type-specific  operations  on  behalf  of  different  transactions  may  be  interleaved.  Analogously,  the  generalized 
locking  scheme  requires  the  definition  of  type-specific  lock  classes ,  which  correspond  roughly  to  the 
operations  on  the  type.  Second,  in  addition  to  the  operations,  the  dependencies  reflect  data  supplied  to  the 
operations  as  arguments  or  data  that  is  otherwise  specific  to  the  particular  object  acted  upon.  Therefore,  an 
instance  of  a  lock  in  the  generalized  locking  scheme  consists  of  two  parts:  the  cype-spccific  lock  class  and 
some  amount  of  instance-specific  data.  It  is  the  inclusion  of  data  in  the  lock  instance  that  differentiates  our 
technique  from  Korth’s.  We  use  the  notation  {LockClnss(data)}  to  represent  an  instance  of  a  lock. 

Once  the  lock  classes  for  a  type  have  been  defined,  a  Boolean  function  must  be  given  that  specifies  whether 
a  particular  new  lock  request  may  be  granted  as  a  function  of  those  locks  already  held  on  the  object  In 
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accordance  with  the  practice  in  database  literature,  diis  function  will  be  represented  by  a  lock  compatibility 
table.  Only  those  locks  held  by  other  transactions  need  be  checked  for  compatibility;  a  new  lock  request  is 
always  compatible  with  oilier  locks  held  by  die  same  transaction. 

To  complete  die  description  of  a  type's  locking  scheme,  one  must  specify  the  protocol  by  which  each  of  the 
type's  operations  acquires  and  releases  locks.  Although  two-phase  locking  can  be  used  with  type-specific 
locks,  the  locking  protocol  may  also  be  type-specific.  A  uniform  two-phase  protocol  is  simplest  to  understand, 
but  die  added  flexibility  of  type-specific  protocols  can  allow  increased  concurrency.  The  exact  nature  of  a 
type-specific  protocol  depends  not  only  on  die  scmandcs  of  the  type,  but  also  on  the  particular  representation 
and  implcmcntauon  chosen. 

6.2  Directories 

A  simple  idealized  implementation  of  the  Directory  type  specified  in  Section  4.1  illustrates  die  basics  of 
type-specific  locking.  In  this  example,  it  is  assumed  that  the  Directory  operations  have  been  implemented  in 
a  straightforward  fashion  with  no  attempt  at  internal  concurrency.  It  is  further  assumed  that  die  operations 
act  under  the  protection  of  a  monitor  or  other  mutual  exclusion  mechanism  during  the  actual  manipulation  of 
Directory  objects.  Locking  is  used  exclusively  to  control  the  sequencing  of  Directory  operations  on  behalf  of 
multiple  transactions.  The  locking  and  mutual  exclusion  mechanisms  cannot  be  completely  independent, 
however,  because  mutual  exclusion  must  be  released  when  waiting  for  a  lock  within  the  monitor.  This  is  a 
standard  technique  in  systems  that  use  monitors  for  synchronization  [Hoare  74]. 

Because  the  mapping  from  invocation  schedules  to  abstract  schedules  is  trivial  for  this  implementation,  the 
second  step  of  the  validation  process  is  eliminated.  The  discussion  of  the  locking  scheme  for  Directories 
therefore  focuses  on  the  first  and  third  steps:  informal  characterization  of  the  set  of  legal  schedules,  and 
comparison  of  this  set  with  the  set  of  consistent  schedules. 

As  was  noted  in  Section  4.1,  the  operations  for  the  Directory  data  type  can  be  divided  into  three  groups: 

•  Modify  operations,  that  alter  the  particular  Directory  entry  identified  by  the  key  string  a. 

•  Lookup  operations,  that  observe  the  presence,  absence,  or  contents  of  the  particular  Directory 
entry  identified  by  the  key  string  a. 

•  Dump  operations,  that  observe  properties  of  the  Directory  that  cannot  be  isolated  to  an  individual 
entry. 

Corresponding  to  these  groups,  three  lock  classes  can  be  defined: 

•  (DirModify(a)}:  To  indicate  that  an  incomplete  transaction  has  inserted  or  deleted  an  entry  with 
key  string  o. 
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•  {l)irLookup(a)}:  To  indicate  that  an  incomplete  transaction  has  attempted  to  observe  the  entry 
with  key  string  a. 

•  {DirDump}:  To  indicate  that  an  incomplete  transaction  has  performed  a  DirDump  of  die  enure 
directory. 


Tlic  lock  compatibility  table  for  Directories  can  be  found  in  Table  1.  Since  there  arc  a  potentially  infinite 
number  of  strings,  the  symbols  a  and  a'  arc  used  to  represent  two  arbitrary  non-identical  strings. 


Lock  Held 


DirModify(o)  l)irLookup(a)  DirDump 


DirModify(cr) 

No 

No 

No 

DirModify(ff') 

OK 

OK 

No 

J)irLookup(a) 

No 

OK 

OK 

Dirlvookupfo') 

OK 

OK 

OK 

DirDump 

No 

OK 

OK 

Table  1:  Lock  Compatibility  Table  for  Directories 


Each  entry  in  this  table  reflects  the  nature  of  one  of  the  type-specific  dependency  rclauons  for  Directories. 
Compatible  entries  represent  dependency  relations  in  which  cycles  are  allowed  to  occur:  for  example,  the 
entry  in  row  2.  coiumn  2  is  "OK"  because  cycles  arc  pc -mined  in  the  _  M(o  )  dependency  relation. 
Incompatible  entries  reflect  proscribed  relations,  such  as  the  entry  in  row  I,  column  2,  which  is  due  to  die 
proscribed  M{a)  relation. 

The  protocol  used  by  the  Directory  operations  for  acquiring  and  releasing  locks  is  as  follows: 

•  Dirlnscrt  or  DirDclctc  operations  that  specify  the  key  string  a  obtain  a  {DirModify(a)}  lode  on 
the  Directory.  If  the  operation  succeeds,  the  lock  is  held  unul  end-of-transaction.  If  the  operation 
fails,  the  lock  is  converted  to  a  {DirLookup(a)}  lock,  which  is  held  until  end-of-transaction. 

•  DirLookup  operations  that  specify  the  key  string  cr  obtain  a  {DirLookup(a)}  lock  on  the  Directory 
that  is  held  until  end-of-transaction. 

•  DirDump  operations  obtain  a  {DirDump}  lock  on  the  Directory  that  is  held  until  end-of- 
transaction. 


The  following  example  demonstrates  how  the  components  of  the  locking  scheme  interact  Suppose  a 
Directory  D  is  initially  empty.  If  a  transaction  T:  performs  the  operation  DirDelctcfD,  “Zebra"),  this 
operation  will  fail  by  returning  not  found  and  leave  a  {DirLookup("Zebra")}  lock  on  the  Directory  until  the 
termination  of  Now  suppose  a  second  transaction,  T2,  performs  the  operation 

DirlnscrtfD,  "Zebra",  capa).  According  to  the  protocol,  Dirlnscrt  must  first  obtain  a  {DirModify("Zcbra")} 
lock.  Because  the  dependency  relation  _  M(o>  is  proscribed,  this  lock  is  incompatible  with  the 
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{Diri^okupC'Zcbra")}  lock  already  held  by  Tl  (see  row  column  3  of  the  compatibility  table).  Therefore. 
T2  will  be  blocked.  If  subsequently  becomes  blocked  while  attempting  to  access  an  object  already  locked 
by  T2.  a  deadlock  will  occur.  Both  transactions  arc  then  blocked  attempting  to  form  dependencies  that  arc 
part  of  proscribed  relations.  Although  these  relations  may  involve  different  objects,  or  even  different  types,  a 
cycle  in  the  union  of  the  two  relations  is  effectively  prevented.  This  is  exactly  the  behavior  required  to 
achieve  consistency  among  cooperative  types.  On  the  other  hand,  if  Tj  completes  successfully  the  lock  is 
released  and  the  dependency  of  on  Tt  is  permitted  to  form.  Since  the  l.(a)  —  M (a)  dependency  cannot 
lead  to  cascading  aborts,  one  may  conclude  (after  the  fact)  that  delaying  T2  was  unnecessary. 

By  contrast,  a  transaction  Tj  that  performs  the  operation  Dirlnsert(D,  "GirafTc",capa)  need  not  be  blocked, 
because  the  dependency  relation  is  not  proscribed.  Accordingly,  row  2.  column  3  of  the 

compatibility  table  indicates  that  a  {DirModify("Giraffc“)}  lock  is  compatible  with  a  {DirLookup("Zcbra")} 
lock. 

Although  not  a  formal  proof,  this  example  characterizes  the  set  of  legal  schedules  permitted  by  the 
implementation,  and  shows  how  the  lock  classes,  compatibility  table,  and  locking  protocol  combine  to 
guarantee  that  the  legal  schedules  correspond  to  the  consistent  schedules  defined  in  the  last  section.  They 
capture  the  idea  that,  for  this  abstract  data  type,  synchronization  of  access  depends  on  the  operations  being 
performed,  the  particular  entries  in  the  Directory  they  attempt  to  reference,  and  their  outcome.  Because  locks 
are  on  Directory  objects,  not  components  of  directories,  the  technique  also  handles  phantoms:  entries  that  are 
mentioned  in  operations  but  arc  not  present  in  the  Directory. 

6.3  Strictly  FIFO  Queues 

Type-specific  locking  can  also  be  used  in  implementations  of  the  Queue  data  type  of  Section  4 X  As  in  the 
preceding  example,  assume  a  idealized  implementation  operating  under  conditions  of  mutual  exclusion.  To 
implement  strictly  FIFO  Queues  supporting  only  QEntcr  and  QRemove  operations,  two  lock  classes  arc 
sufficient:  {QEnterfo)}  and  {QRcmovc(<r)}.  As  in  the  case  of  Directories,  locks  on  Queues  identify  the 
particular  entry  to  which  the  operation  requesting  the  lock  refers.  Since  Queue  entries  are  not  identified  by 
key  strings,  it  is  assumed  that  at  QF.ntcr  time,  each  element  is  assigned  an  identifier  unique  to  the  Queue 
instance.  These  identifiers  correspond  to  those  used  in  defining  the  dependency  relations.  Thus,  a 
{QEntcr(a)}  lock  indicates  that  an  element  with  identifier  a  has  been  entered  into  the  Queue  by  an 
incomplete  transaction.  Likewise,  a  (QRemovefa)}  lock  indicates  that  the  element  with  identifier  a  has  been 
removed  form  the  Queue  by  an  incomplete  transaction. 

The  protocol  for  the  Queue  operations  is: 
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•  QFjitcr  operations  must  obtain  a  {QKiitcr(cr)}  lode,  where  a  is  the  newly-assigned  identifier  for 
the  entry  to  be  added.  This  lock  is  held  until  cnd*of-tnmsaction. 


i 

i 

•  QRemove  operations  must  obtain  a  {QRcmovc<e)l  lock,  where  o  is  the  identifier  of  die  entry  at 
the  head  of  the  Queue.  This  lock  is  held  until  cnd-of-transaction.  Note  that  obtaining  a 
{Q Reminder) }  lock  docs  not  necessarily  imply  that  an  entry  a  is  actually  in  the  Queue,  because 
the  transaction  that  made  the  entry  may  liavc  since  aborted.  If  so.  die  QRemove  operation  must 
request  a  {QRcmovc<a*)J  lock  on  the  new  headmost  entry,  o'. 

Table  2  shows  the  lock  compatibility  table  for  Queues.  As  usual,  the  symbols  a  and  o'  represent  the 
identifiers  of  two  different  elements.  Because  die  element  identifiers  arc  unique,  certain  situations  (c.g. 
attempting  to  enter  an  element  with  the  same  identifier  as  an  element  already  removed)  cannot  occur.  The 
compatibility  function  is  undefined  in  these  eases,  so  the  table  entries  arc  marked  ‘NA’  for  'Not  Applicable'. 


Lock  Held 
QEntcrfo) 

QRcmovcfo) 

Lock  Requested  QEntcrfo) 

NA 

NA 

QKnteri<r’) 

No 

OK 

QRemovc(o) 

No 

NA 

QRcmovc(  o') 

OK 

No 

Table  2:  Lock  Compatibility  Table  for  Queues 

The  lock  compatibility  table  reflects  the  limited  concurrency  of  this  type.  Once  a  QRemove  operation  has 
retrieved  the  entry  with  identifier  o.  some  entry  with  identifier  o'  becomes  the  head  element  of  the  Queue. 
But  other  transactions  will  be  blocked  trying  to  obtain  the  {QRcmovcfo’)}  lock  needed  to  remove  it.  until  the 
first  transaction  completes.  Multiple  QEntcr  operations  on  behalf  of  different  transactions  interact  in  the 
same  way.  The  incompatibility  of  {QRcmovc(<r)}  with  {QEntcrfo)}  ensures  that  an  uncommitted  entry 
cannot  be  removed  from  the  Queue,  thereby  eliminating  a  potential  cause  of  cascading  aborts. 

6.4  WQueues 

For  a  comparable  idealized  implcmentadon  of  WQueues  supporting  only  WQEnter  and  WQRemove,  the 
same  lock  classes  may  be  used  as  for  FIFO  Queues.  The  major  difference  between  the  two  types  shows  up  in 
the  lock  compatibility  function,  given  by  Table  3.  To  reflect  the  allowability  of  interleaved  WQEnter 
operations  by  different  transactions,  the  table  entry  in  row  2,  column  2  defines  (WQEntcrfo)}  and 
{WQEntcrfe’)}  locks  to  be  compatible.  Similarly,  the  entry  in  row  4,  column  3  now  permits  multiple 
transactions  to  perform  WQRemove  operations.  The  only  remaining  restriction  is  the  one  in  row  3,  column  2 
that  prevents  uncommitted  entries  from  being  removed.  This  prevents  cycles  in  the  proscribed  _  R(<f) 
dependency  relation  and,  because  the  lock  is  held  until  end-of-transaction,  also  prevents  cascading  aborts. 


Lock  Held 

WQKntcrfo)  WQRcmovc(o) 


Lock  Requested  WQRntcr(a)  NA  NA 

WQKntcKff*)  OK  OK 

WQRcmovcfe)  No  NA 

WQRemovcfff')  OK  OK 


Table  3:  l.<x:k  Compatibility  Table  for  WQucucs 

The  locking  protocol  for  the  WQueue  operations  is  substantially  the  same  as  the  one  for  the  Queue 
operations.  The  only  difference  is  that  a  WQRcmovc  operation  that  is  unable  to  obtain  the  required 
{WQRcmovcfa)}  lock  on  the  element  at  the  head  of  the  WQueue  docs  not  block.  Instead,  WQRcmovc 
searches  down  the  WQueue  for  some  other  element  with  identifier  o',  for  which  a  {WQRcmovcfo’)}  lock  can 
be  obtained.  This  reflects  the  property  of  WQucucs  that  permits  elements  farther  down  the  WQueue  to  be 
removed  when  the  head  element  is  uncommitted.  If  no  element  can  be  found,  the  operation  is  blocked  until 
an  inserting  transaction  commits. 

6.5  Summary 

The  examples  in  this  section  have  shown  how  type-specific  locking  can  be  used  for  synchronization  in 
implementations  of  several  data  types.  The  examples  show  how  locking  can  be  used  to  prevent  cycles  in 
proscribed  dependency  relations,  including  cycles  containing  several  types  of  objects.  Ihcy  also  indicate  how 
locking  can  be  used  to  prevent  cascading  aborts. 

A  ftifi  discussion  of  the  syntax  and  implementation  of  type-specific  locking  mechanisms  is  beyond  the  scope 
of  this  paper.  Further  work  is  needed  to  determine  die  specific  primitives  required  for  definition  of  new 
object  types,  locking,  unlocking,  conditional  locking,  etc.  Another  area  requiring  further  study  is  the 
relationship  between  the  locking  mechanism  and  other  synchronization  mechanisms  that  are  used  for  mutual 
exclusion  and  to  signal  events.  It  appears,  however,  that  implementation  of  a  type-specific  locking 
mechanism  is  often  no  more  complex  or  expensive  than  implementations  of  standard  locking.  Unlike 
predicate  locking  schemes  [Eswaran  76],  the  set  of  locks  that  apply  to  a  particular  object  can  easily  be 
determined.  It  is  also  not  difficult  to  determine  what  processes  may  be  awakened  in  response  to  an  event  such 
as  transaction  completion. 
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7  Summary 

This  paper  has  been  concerned  with  synchronizing  transactions  that  access  shared  abstract  types,  in  cur 
model,  four  properties  distinguish  such  types  from  others: 


•  Operations  on  them  are  permanent. 

•  They  support  failure  atomicity  of  transactions. 

•  They  do  not  permit  cascading  aborts. 

•  They  contribute  to  preserving  ordering  properties  of  groups  of  transactions. 

These  properties  arc  not  independent,  and  the  mechanisms  that  arc  used  to  achieve  them  are  therefore  related 
as  wclL 

Schedules  and  dependencies  arc  useful  in  understanding  the  interaction  between  concurrent  transactions. 
The  well-known  consistency  property  of  scrializability  can  be  redefined  as  a  special  ease  of  ordcrability  with 
respect  to  a  dependency  relation..  The  specific  dependency  relation  depends  on  how  much  semantic 
knowledge  is  available  concerning  operations  on  objects.  When  Read  operations  are  distinguished  from 
Write  operations,  scrializability  requires  ordcrability  with  respect  to  a  less  restrictive  dependency  relation  than 
when  this  distinction  is  not  made.  Dependencies  can  also  be  used  to  characterize  schedules  that  arc  not  prone 
to  cascading  aborts. 

Additional  type-specific  semantic  knowledge  about  operations  can  allow  additional  concurrency.  The 
interleaving  specifications  for  Directories  and  Queues  developed  in  Sections  4.1  and  4 2  were  stated  in  terms 
of  ordcrability  with  respect  to  type-specific  dependencies.  To  increase  concurrency  further,  the  WQueue 
sacrifices  serializability  while  preserving  ordcrability  with  respect  to  a  less  restrictive  dependency.  When 
several  abstract  types  are  combined  in  a  transaction,  ordcrability  must  be  guaranteed  with  respect  to  the 
relation  that  is  the  union  of  the  proscribed  relations  of  the  individual  types. 

Section  6  described  a  locking  mechanism  for  implementing  the  synchronization  required  by  the  types 
described  in  Section  4.  By  allowing  locks  that  consist  of  a  type-specific  lock  class  and  instance-specific  data, 
the  mechanism  provides  a  powerful  framework  for  using  type-specific  semantics  in  synchronization.  This 
mechanism  is  suitable  for  use  in  transactions  containing  multiple  types,  and  it  can  also  be  used  to  prevent 
cascading  aborts.  The  implementation  of  Directories  shows  how  type-specific  locking  permits  a  uniform 
treatment  of  the  problem  of  phantoms.  Locks  need  not  be  directly  associated  with  particular  components  of 
objects,  which  facilitates  the  separadon  of  synchronization  from  other  type  representation  issues.  The 
examples  of  various  Queue  types  show  the  mechanism's  flexibility. 
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This  paper  has  not  provided  a  complete  discussion  of  the  issues  involved  in  the  specification  and 
implementation  of  shared  abstract  types.  For  example,  we  have  not  discussed  the  construction  of  compound 
shared  abstract  types,  which  use  other  shared  abstract  types  in  their  implementation.  (However.  Schwarz 
[Schwarz  82)  gives  an  example  of  this.)  In  addition,  we  have  hardly  mentioned  recovery  considerations, 
though  we  believe  logging  mechanisms  as  described  by  Lindsay  [Undsay  79)  can  be  extended  to  meet  the 
needs  of  shared  abstract  types.  Recovery  is  discussed  more  fully  in  a  related  paper  [Schwarz  83).  Finally,  we 
have  not  discussed  specific  algorithms  for  coping  with  deadlocks. 

Dearly,  the  definition  and  implementation  of  shared  absurd  types  is  more  difficult  than  the  definition  and 
implementation  of  regular  abstract  types.  However,  once  thrsc  types  arc  implemented,  programmers  can 
construct  arbitrary  transactions  that  invoke  operations  on  the  ?/ncs.  The**  transactions  should  greatly 
simplify  the  construction  of  reliable  distributed  systems.  Though  this  paper  has  focused  entirely  on 
synchronization,  we  believe  that  this  topic  is  central  to  understanding  how  transactions  can  be  used  as  a  basic 
building  block  in  the  implementation  of  distributed  systems. 
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Abstract 

» 

|  Transactions  have  proven  to  be  a  useful  tool  for  constructing  reliable  database  systems  and  are  likely  to  be 

useful  in  many  types  of  distributed  systems.  To  exploit  transactions  in  a  general  purpose  distributed  system, 
each  node  can  execute  a  transaction  kernel  that  provides  services  necessary  to  support  transactions  at  higher 
system  levels.  The  transaction  model  that  the  kernel  supports  must  permit  arbitrary  operations  on  the  wide 
'  collection  of  data  types  used  by  programmers.  New  techniques  must  be  developed  for  specifying  the 

I  synchronization  and  recovery  properties  of  abstract  types  that  arc  used  in  transactions.  Existing  mechanisms 

for  synchronization,  recovery,  deadlock  management  and  communication  are  often  inadequate  to  implement 
these  types  efficiently,  and  they  must  be  adapted  or  replaced. 
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1.  Introduction 

Distributed  computing  systems  are  potentially  reliable.  because  the  redundancy  and  autonomy  present  in 
them  permit  failures  to  be  masked  or  localized.  A  major  challenge  in  distributed  computing  research  is  to 
realize  this  potential  without  incurring  intolerable  penalties  in  complexity,  cost,  or  performance. 
Consequently,  there  is  currently  great  interest  in  general-purpose  methodologies  and  practices  that  simplify 
the  construction  of  efficient  and  robust  distributed  systems.  This  paper  discusses  a  methodology  based  on 
transactions  and  includes  a  survey  of  considerations  in  the  design  of  a  transaction  kernel :  an  abstract  machine 
that  supports  transactions. 

Transactions  were  originally  developed  for  database  management  systems,  to  aid  in  maintaining  arbitrary 
application-dependent  consistency  constraints  on  stored  data.  The  constraints  must  be  maintained  despite 
failures  and  without  unnecessarily  restricting  the  concurrent  processing  of  application  requests. 

In  the  database  literature,  transactions  are  defined  as  arbitrary  collections  of  operations  bracketed  by  two 
markers:  BeginTransaction  and  EndT ransaction,  and  have  the  following  special  properties: 

•  Either  all  or  none  of  a  transaction's  operations  are  performed.  This  property  is  usually  called 
failure  atomicity. 

•  If  a  transaction  completes  successfully,  the  results  of  its  operations  will  never  subsequently  be  lost. 

This  property  is  usually  called  permanence. 

•  If  several  transactions  execute  concurrently,  they  affect  the  database  as  if  they  were  executed 
serially  in  some  order.  This  property  is  usually  called  serializability. 

•  An  incomplete  transaction  cannot  reveal  results  to  other  transactions,  in  order  to  prevent 
cascading  aborts  if  the  incomplete  transaction  must  subsequently  be  undone. 

Transactions  lessen  the  burden  on  application  programmers  by  simplifying  the  treatment  of  failures  and 
concurrency.  Failure  atomicity  makes  certain  that  when  a  transaction  is  interrupted  by  a  failure,  its  partial 
results  are  undone.  Programmers  arc  therefore  free  to  violate  consistency  constraints  temporarily  during  the 
execution  of  a  transaction.  Serializability  ensures  that  other  concurrently  executing  transactions  cannot 
observe  these  inconsistencies.  Prevention  of  cascading  aborts  limits  the  amount  of  effort  required  to  recover 
from  a  failure. 

Database  management  systems  are  not  the  only  ones  that  must  assure  die  consistency  of  stored  data  despite 
failures  and  concurrency.  Various  ad  hoc  techniques  have  evolved  for  this  purpose.  For  example,  TOPS-IO 
(Digital  Equipment  Corporation  72]  and  numerous  other  file  systems  permit  atomic  updates  to  a  single  disk 


file.  This  technique  lacks  flexibility  and  generality,  however,  and  leads  to  unnecessary  restrictions  on 
concurrency. 

Considerable  research  effort  is  currently  being  expended  towards  extending  the  utility  of  transactions 
beyond  database  applications.  At  MIT,  the  Argus  project  [Liskov  82a )  is  adding  transaction  facilities  to  the 
CLU  language.  Transactions  will  also  be  available  in  the  Oouds  distributed  operating  system  [Allchin  82]. 

At  Camegie-Mellon,  we  arc  exploring  the  idea  of  implementing  a  transaction  kerne!  on  each  node  of  a 
distributed  system.  A  transaction  kerne!  is  a  basic  system  component  that  supplies  primitives  for  supporting 
transactions  and  the  shared  abstract  data  types  on  which  they  operate.  Complex,  costly,  and  redundant  error 
recovery  mechanisms  could  be  avoided  elsewhere,  if  this  facility  were  available.  A  transaction  kernel  should 
also  lead  to  compatible  structuring  of  die  various  systems  that  use  it,  simplifying  their  interconnection. 

This  report  is  an  overview  of  recent  research  on  transaction  systems,  and  surveys  issues  that  arise  in 
developing  a  transaction  kernel.  We  consider  the  extension  of  transactions  to  general  programming  and 
discuss  how  a  transaction  kernel  should  facilitate  data  abstraction.  Subsequent  sections  examine  what  we 
believe  to  be  the  central  issues  in  building  a  transaction  kernel:  synchronizing  access  to  shared  abstract  types 
without  unnecessarily  restricting  concurrency,  managing  deadlocks,  recovering  from  failures,  and 
communicating  efficiently  between  sites.  For  more  on  the  extended  use  of  transactions,  we  refer  the  reader  to 
recent  reports  by  Liskov,  Allchin,  Jacobson,  and  ourselves  [Allchin  82,  Liskov  82b,  Liskov  82a,  Jacobson 
82,  Schwarz  82]. 

2.  Extensions  to  the  Transaction  Model 

A  construct  that  gives  programmers  a  uniform  strategy  for  treatment  of  failures,  controls  interaction 
between  concurrently  executing  processes,  and  ensures  permanence  of  operations  should  simplify  the 
production  of  reliable  distributed  systems.  Except  for  database  applications,  however,  the  utility  of 
transactions  has  not  been  widely  demonstrated.  Lomct  hypothesized  that  transactions  would  be  useful  for 
general  programming  [Lomet  77],  but  the  literature  includes  sketches  of  only  a  few  non-database  systems 
based  on  transactions  [Liskov  82a,  Allchin  82,  Gifford  79,  Daniels  82J. 

The  traditional  transaction  model,  as  described  by  Gray  [Gray  80],  was  designed  primarily  for 
understanding  database  management  applications.  It  must  be  extended  to  model  the  additional  requirements 
imposed  by  general-purpose  distributed  systems.  For  instance,  real-time  systems  may  require  real-time 
synchronization  of  the  participants  in  transactions  (sec  Section  6).  File  and  mail  systems  that  are  both  highly 
available  and  highly  reliable  arc  also  difficult  to  implement  unless  constructs  not  in  the  traditional  transaction 
model  arc  used.  Their  transactions  are  more  complex  titan  those  in  database  systems,  and  their  performance 


requirements  are  potentially  higher.  Gray  also  comments  on  the  limitations  of  the  traditional  transaction 
model  [Gray  81a]. 

Database  systems,  and  their  transaction  mechanisms,  do  not  fully  support  the  abstract  data  types  that  are 
required  in  more  general  systems.  In  a  database,  the  basic  unit  of  information  is  the  typed  record,  which  can 
be  aggregated  into  indexed  files.  The  only  operations  on  records  arc  read  and  write,  and  only  operations  such 
as  insert,  lookup,  or  sort  arc  defined  for  files. 

Systems  that  encourage  data  abstraction  must  be  more  flexible.  They  must  permit  the  definition  of 
arbitrary  object  types  with  corresponding  sets  of  type-specific  operations.  They  must  also  allow  new  object 
types  to  be  implemented  by  combining  existing  ones,  and  the  resulting  types  should  appear  to  their  users  as 
primitive  types.  Rather  than  a  sequence  of  reads  and  writes  on  records,  a  transaction  becomes  a  hierarchy  of 
typed  operations  on  objects.  Transactions  can  be  nested,  if  some  of  die  operations  in  the  hierarchy  are 
themselves  implemented  with  transactions. 

Nested  transactions  are  also  useful  for  controlling  the  interaction  of  multiple  processes  within  a  single 
transaction,  or  salvaging  partial  results  when  a  transaction  aborts.  For  example,  some  real-time  applications 
employ  fairly  lengthy  transactions.  If  aborting  such  transactions  and  restarting  them  from  the  beginning 
would  cause  intolerable  delays,  the  transactions  must  instead  fall  back  to  intermediate  save  points  [Gray  81c). 
See  Reed’s  and  Moss’  theses  for  more  about  nested  transactions  [Reed  78,  Moss  81]. 

In  database  systems,  application  programmers  do  not  have  to  specify  the  consistency  constraints  that  they 
wish  transactions  to  preserve.  By  guaranteeing  serializability  of  all  transactions,  database  transaction 
mechanisms  assure  that  any  consistency  constraint  preserved  when  a  transaction  runs  in  isolation  will  also  be 
preserved  when  transactions  run  concurrently.  The  transaction  manager  must  delay  or  abort  transactions  as 
necessary  to  make  this  guarantee.  If  the  system  were  aware  of  the  specific  consistency  constraints  that 
transactions  were  intended  to  maintain,  it  could  use  this  extra  information  in  deciding  whether  or  not  to  delay 
or  abort  transactions.  Avoiding  unnecessary  delays  and  aborts  would  improve  performance.  Semantic 
knowledge  about  individual  types,  their  operations,  and  their  implementations  could  also  be  used  to  make 
better-informed  decisions  regarding  concurrent  access  to  objects.  A  transaction  mechanism  efficient  enough 
for  use  in  general-purpose  distributed  systems  must  be  flexible  enough  to  allow  such  use  of  semantic 
information  to  achieve  greater  concurrency. 

Our  approach  is  to  focus  on  the  individual  shared  abstract  types  th3t  programmers  use  in  constructing 
transactions.  In  addition  to  the  traditional  properties  of  abstract  types,  these  types  can  be  characterized  by 
their  synchronization  and  recovery  properties.  The  specification  of  these  properties  defines  the  types’  exact 
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behavior  under  conditions  of  concurrency  or  failure.  Assuming  different  types  cooperate  in  a  reasonable 
fashion,  the  specifications  allow  programmers  to  determine  whether  particular  types  will  meet  their  needs. 
Sections  3  and  S  discuss  in  detail  the  synchronization  and  recovery  properties  of  shared  abstract  types. 

We  have  found  types  with  specialized  synchronization  and  recovery  properties  to  be  useful  in  designing  a 
highly  available  message  system.  For  example,  message  repositories,  which  are  replicated  on  several  sites,  are 
highly  specialized  shared  abstract  objects  with  unique  sets  of  operations.  In  addition  to  reading  and  writing 
messages,  special  operations  permit  out-of-date  message  repositories  to  "catch  up"  with  current  ones.  The 
recovery  and  synchronization  properties  of  these  operations  arc ’type-specific  and  must  be  carefully  specified 
and  analyzed. 

3.  Synchronization 

In  a  transaction-based  system,  synchronization  is  important  to  both  the  specification  and  the 
implementation  of  shared  abstract  types.  Traditional  methods  for  synchronizing  access  to  objects  (e.g., 
monitors  [Hoarc  74])  just  prevent  concurrent  operations  on  a  particular  object  from  interfering  with  one 
another.  Maintaining  consistency  constraints  that  encompass  groups  of  objects  necessitates  additional 
synchronization.  However,  the  mechanisms  that  enforce  this  additional  synchronization  must  not 
unnecessarily  restrict  concurrency.  Because  transactions  are  arbitrarily  large  collections  of  operations,  a 
synchronization  action  that  is  in  force  over  the  entire  scope  of  a  transaction  can  potentially  degrade 
performance  more  severely  than  a  synchronization  action  that  only  affects  a  single  operation. 

One  approach  to  synchronization  in  a  general  transaction-based  system  is  to  classify  each  operation  on  an 
abstract  type  as  either  a  Read  or  a  Write.  A  two-phase  Rcad/Write  locking  scheme  [Eswaran  76]  ensures 
serializability  and,  if  locks  are  held  until  cnd-of-transaction,  prevents  cascading  aborts  as  well.  However,  such 
techniques  for  managing  concurrency  make  minimal  use  of  semantic  knowledge  about  the  objects  that 
transactions  manipulate,  and  therefore  they  may  prevent  or  delay  operations  unnecessarily. 

For  example,  consider  two  transactions  that  each  insert  a  new  entry  in  a  directory  object.  Since  the 
insertion  operations  modify  the  directory  object,  one  must  classify  them  as  Write  operations.  The  standard 
rules  for  Read/Writc  locking  prohibit  modification  of  an  object  by  more  than  one  incomplete  transaction. 
The  system  would  therefore  delay  the  second  insertion  until  the  transaction  making  the  first  one  either 
committed  or  aborted.  Goser  examination  of  the  semantics  of  insertion  reveals  that  this  is  unnecessary  if  the 
two  insertions  specify  different  keys.  A  synchronization  mechanism  that  could  use  this  extra  knowledge  could 
achieve  greater  concurrency. 

Similarly,  specifying  serializability  as  the  goal  of  a  transaction  synchronization  strategy  reflects  a  limited  use 


of  semantic  knowledge.  Scrializability  makes  sure  that  any  invariant  preserved  by  an  individual  transaction 
will  also  be  preserved  when  transactions  execute  concurrently.  This  guarantee  is  frequently  too  strong.  For 
instance,  consider  a  queue  that  buffers  units  of  work  between  activities  that  produce  and  consume  them. 
Serializing  the  transactions  that  operate  on  the  buffer  queue  groups  together  all  entries  made  by  a  single 
transaction,  in  order  to  enforce  their  consecutive  removal.  In  many  applications,  ordering  of  entries  in  the 
buffer  is  not  crucial  as  long  as  entries  for  which  the  inserting  transaction  has  committed  eventually  reach  the 
head  and  can  be  removed.  Entries  inserted  by  incomplete  transactions  must  not  be  removed,  however,  so  that 
cascading  aborts  cannot  occur.  As  in  the  preceding  example,  using  more  semantic  knowledge  about  the 
object  and  its  intended  purpose  can  lead  to  greater  concurrency. 

Many  authors  [Eswaran  76,  Kung  79.  Allchin  82,  Garcia*Molina  82.  Sha  83]  have  observed  that  using 
semantic  knowledge  can  increase  concurrency.  While  Garcia  and  Sha  consider  the  properties  of  entire 
transactions,  we  are  concentrating  on  the  semantics  of  operations  on  individual  types.  To  exploit  this 
approach,  one  must  first  be  able  to  specify  precisely  and  concisely  how  a  type  behaves  under  conditions  of 
concurrent  access  by  multiple  transactions.  Prospective  users  need  such  a  means  of  specification  to  define 
their  own  requirements  and  to  compare  them  with  the  properties  of  available  types.  We  have  investigated 
dependencies  as  a  tool  for  this  purpose.  Dependencies  were  originally  used  in  database  research  for  proving 
the  correctness  of  two-phase  locking  protocols  [Eswaran  76,  Gray  75].  A  dependency  exists  between  any  two 
transactions  that  perform  an  operation  on  a  common  object,  and  the  dependency  defines  the  order  in  which 
die  two  transactions  operate  on  the  object. 

One  can  prove  that  if  the  transitive  closure  of  all  the  dependencies  among  transactions  forms  a  partial 
order,  then  the  execution  of  the  transactions  is  serializable  [Eswaran  76].  If  the  transitive  closure  contains 
cycles,  the  ordering  of  transactions  is  ambiguous.  Not  all  dependencies  are  equivalent,  however.  For 
example,  the  semantics  of  the  Read  operation  tell  us  that  the  order  in  which  two  transactions  read  a  common 
object  has  no  effect  on  the  transactions’  outcome.  Even  though  the  transitive  closure  of  all  dependencies  has 
cycles,  disregarding  these  meaningless  dependencies  and  recomputing  the  transitive  closure  may  result  in  a 
partial  order  of  the  transactions.  In  general,  a  group  of  transactions  is  orderable  with  respect  to  a  particular 
group  of  proscribed  dependencies  if  the  transitive  closure  of  the  proscribed  dependencies  yields  a  partial 
order.  Serializability  in  a  database  with  Rcad/Writc  locking  can  be  defined  in  these  terms  as  orderabiiity  with 
respect  to  all  dependencies  except  those  for  which  both  operations  arc  Reads  [Gray  75]. 

In  a  general-purpose  system  with  arbitrary  shared  abstract  types,  a  set  of  proscribed  dependencies  must  be 
defined  for  each  type.  Semantic  knowledge  about  individual  types  can  be  used  in  constructing  this  set,  to 
achieve  high  concurrency  while  still  helping  the  programmer  to  preserve  consistency.  For  instance,  the 
proscribed  set  of  dependencies  for  directories  would  not  include  dependencies  between  transactions  operating 


on  entries  with  different  keys.  Like  dependencies  in  which  both  operations  arc  Reads,  these  dependencies 
cannot  affect  consistency.  To  specify  a  queue  type  for  which  grouping  of  elements  by  inserting  transaction  is 
not  assured,  dependencies  between  transactions  performing  the  insert  operation  can  be  removed  from  the 
proscribed  dependency  set. 

When  a  transaction  accesses  several  objects  of  different  types,  the  types  must  cooperate  to  maintain  global 
consistency.  In  addition  to  guaranteeing  ordcrability  with  respect  to  the  proscribed  dependency  sets  of  the 
individual  types,  the  transaction  manager  must  also  preserve  ordcrability  with  respect  to  the  union  of  the 
proscribed  dependency  sets. 

Dependencies  can  also  be  used  to  specify  which  operations  must  be  delayed  to  prevent  potential  cascading 
aborts.  Whenever  a  dependency  is  about  to  form  between  two  incomplete  transactions,  the  second 
transaction  may  have  to  be  delayed  in  case  the  first  one  aborts.  The  decision  whether  or  not  to  delay  depends 
on  the  exact  dependency  being  formed.  Analogous  to  the  proscribed  dependency  set,  each  type  must  specify 
a  deferred  dependency  set  that  determines  the  circumstances  under  which  operations  will  be  delayed  until  a 
prior  transaction  commits  or  aborts.  Usually,  dependencies  that  represent  a  transfer  of  information  between 
the  two  transactions  must  be  deferred.  A  more  extensive  treatment  of  the  dependency  technique,  including 
detailed  examples,  can  be  found  in  a  related  paper  [Schwarz  82]. 

Shared  abstract  types  can  be  divided  into  three  categories  based  on  their  synchronization  behavior.  The 
categories  are  listed  in  order  of  increasing  potential  for  concurrent  access,  and  each  properly  includes  the 
preceding  ones. 

L  Types  that  serialize  access  to  objects.  These  types  can  use  semantic  knowledge  to  permit  greater 
concurrent  access  to  an  object  without  losing  the  advantages  of  scrializability.  The  directory  that 
allows  concurrent  operations  on  entries  with  different  keys  is  in  this  category.  The  proscribed 
dependency  sets  for  these  types  includes  all  dependencies  that  have  a  detectable  effect  on 
transaction  outcomes. 

2.  Types  that  do  not  permit  incomplete  transactions  to  reveal  their  results  to  other  transactions. 

Since  transactions  'are  not  necessarily  serializable,  this  strategy  docs  not  guarantee  arbitrary 
consistency  constraints,  but  can  lead  to  higher  concurrency  while  still  preserving  properties  that 
are  crucial  to  the  purpose  of  the  type.  The  queue  that  docs  not  guarantee  grouping  by  inserting 
transaction  is  in  this  category.  Some  dependencies  that  affect  transaction  outcomes  can  be 
excluded  from  these  types*  proscribed  dependency  sets. 

3.  Types  with  arbitrary  synchronization  policies.  Incomplete  transactions  that  operate  on  objects  of 
these  types  may  reveal  data  to  other  transactions;  it  is  assumed  that  these  data  arc  acceptable  (i.e„ 
will  not  cause  cascading  aborts)  even  if  the  revealing  transaction  subsequently  aborts.  An  update 


that  is  used  only  as  a  "hint”  can  be  revealed,  for  instance.  Even  if  another  transaction  reads  the 
hint  while  it  has  an  incorrect  value,  no  fatal  error  will  occur.  In  terms  of  dependencies,  the 
deferred  dependency  sets  for  these  types  may  exclude  some  dependencies  that  transfer 
information. 

Given  dependencies  as  a  means  of  specification,  a  second  key  to  achieving  efficient  synchronization  by 
utilizing  semantic  knowledge  is  the  definition  of  a  synchronization  mechanism  flexible  enough  to  implement 
a  wide  variety  of  shared  abstract  types.  We  have  examined  using  type- specific  locking  as  this  mechanism. 
Bernstein,  Goodman,  and  Lai  [Bernstein  81]  discuss  some  of  this  method’s  basic  principles.  Korth  [Korth  83] 
has  described  a  type-specific  approach  to  locking  based  on  commutativity  of  operations,  which  employs  a 
hierarchy  of  locks  to  allow  variable-granularity  locking.  Wcihl.  in  connection  with  the  Argus  Project,  has 
described  crowds4,  an  alternative  synchronization  mechanism  for  exploiting  type-specific  semantics  [Weihl 
81].  Transactions  must  join  a  crowd  before  accessing  an  object  and  only  leave  the  crowd  when  the  transaction 
is  complete.  Type-specific  rules  determine  whether  a  transaction  should  be  admitted  a  crowd  or  be  forced  to 
wait  undl  some  other  conflicting  transaction  leaves. 

A  set  of  basic  principles  underlies  all  locking  schemes.  Before  a  transaction  manipulates  an  object,  it  must 
obtain  a  lock  on  the  object  Possession  of  the  lock  restricts  further  access  to  the  object  by  other  transactions, 
until  it  is  released.  Locking  mechanisms  thus  control  the  formation  of  dependencies  among  transactions. 
Whenever  one  transaction  waits  for  a  lock  held  by  another,  formation  of  a  dependency  between  the  two 
transactions  is  delayed  until  the  lock  is  released.  The  protocol  for  acquiring  and  releasing  locks  ensures  that  if 
the  dependency  would  become  part  of  a  cycle  in  the  transitive  closure  of  a  set  of  proscribed  dependencies,  a 
deadlock  results  and  the  cycle  never  forms. 

The  simplest  locking  mechanisms  have  only  one  kind  of  lock,  regardless  of  the  type  of  the  object  to  be 
locked  or  the  operation  to  be  performed.  This  form  of  locking  uses  no  semantic  knowledge,  and  cannot 
distinguish  between  proscribed  and  non-proscribcd  dependencies.  Many  database  systems  use  a  locking 
mechanism  that  provides  two  lock  classes,  Read  and  Write.  Operations  that  modify  an  object  must  first 
obtain  a  Write  lock,  whereas  operations  that  merely  reference  an  object's  value  need  only  obtain  a  Read  lock. 
The  rules  for  obtaining  locks  specify  that  multiple  transactions  may  simultaneously  hold  Read  locks  on  an 
object,  but  holding  a  Write  lock  reserves  the  object  exclusively  for  one  transaction.  By  making  this  coarse 
distinction  among  different  kinds  of  operations,  Rcad/Write  locking  uses  limited  semantic  information  to 
permit  some  cyclic  dependencies  while  prohibiting  others.  This  yields  greater  concurrency  without 
compromising  consistency. 

Type-specific  locking  generalizes  the  ideas  behind  Rcad/Write  locking.  Instead  of  dividing  all  operations 
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into  two  broad  classes,  the  implementor  of  each  type  can  define  appropriate  typc-spccific  lock  classes  and 
associated  rules  for  acquiring  and  releasing  locks.  'Hie  rules  specify  the  kmd(s)  of  lock  required  by  each  of  the 
type's  operations  and  which  kinds  of  locks  arc  compatible  with  each  other.  By  tailoring  the  locking  strategy  to 
suit  a  specific  type  and  implementation,  type-specific  locking  preserves  only  what  is  promised  by  the  type  s 
specification.  A  large  amount  of  semantic  information  about  both  the  specification  and  the  implementation 
of  the  type  can  be  used  in  deciding  whether  an  operation  must  be  delayed  or  prevented. 

Additional  research  is  needed  to  determine  specific  primitives  for  locking,  unlocking,  definition  of  new 
object  types,  etc.  For  example,  data  stored  in  an  object  or  supplied  as  an  argument  to  an  operation  is 
sometimes  crucial  in  determining  the  compatibility  of  two  operations.  Recall  that  insert  operations  on 
directories  arc  compatible  only  if  they  refer  to  entries  with  different  keys.  Type-specific  locking  primitives 
must  permit  the  association  of  auxiliary  information  with  locks  on  objects. 

A  related  paper  by  Schwarz  and  Spcctor  [Schwarz  82]  contains  further  details  and  examples  of  type-specific 
locking.  It  appears  that  implementations  of  typc-spccific  locking  mechanisms  will  be  reasonably  simple  and, 
in  order  to  understand  their  details  more  completely,  we  arc  building  one  using  directories  as  a  sample  shared 
abstract  type. 

4.  Deadlock 

One  must  consider  the  possibility  of  deadlock  in  any  system  where  processes  may  wait  for  dynamically 
allocated  resources.  In  a  transaction,  the  resources  arc  the  objects  that  the  transaction  accesses.  There  are 
many  strategies  for  coping  with  deadlocks,  but  it  is  not  dear  which  are  most  appropriate  for  transaction-based 
systems  with  arbitrary  shared  abstract  types. 

One  approach  is  to  impose  a  global  ordering  on  all  system  resources,  and  force  all  transactions  to  obtain 
resources  according  to  this  ordering.  This  method  is  unsuitable,  because  it  does  not  allow  transactions  that 
access  a  data-dependent  collection  of  objects.  When  the  system  initiates  a  transaction,  it  must  know  a  prion 
all  the  resources  the  transaction  will  need.  Another  technique  uses  timestamps  on  transactions  or  objects  to 
avoid  deadlock  [Rosenkrantz  78,  Reed  78].  A  third  approach  to  the  problem  is  to  allow  deadlocks, 
subsequently  detect  them,  and  ultimately  resolve  them  by  selecting  a  transaction  to  abort  Either  timeouts  or 
an  algorithm  that  analyzes  waiting  transactions  can  be  used  for  detection.  Unfortunately,  employing  timeouts 
causes  the  timing  behavior  of  an  abstract  data  type's  implementation  to  become  a  critical  aspect  of  the  type's 
specification.  In  cither  case,  detection  and  resolution  of  deadlocks  could  become  a  bottleneck  that  would 
constrain  performance. 

Arbitrary  typc-spccific  locking  protocols  can  cause  another  problem.  If  a  protocol  allows  the  release  of 


some  locks  prior  to  cnd*of*transaction,  it  may  be  necessary  to  re-acquirc  them  later  to  process  an  abort. 
Re-acquisition  violates  the  common  simplifying  assumption  that  aborting  a  transaction  never  requires 
additional  resources.  Deadlocks  can  therefore  occur  during  abort  processing,  and  the  standard  approach  of 
aborting  a  waiting  transaction  cannot  resolve  them. 

There  has  been  fairly  little  formal  analysis  of  the  relationships  between  the  probability  of  waiting  or 
deadlock  and  such  factors  as  degree  of  multi-programming,  number  of  operations  in  a  transaction,  and  size  of 
the  shared  database.  However,  Gray  ct  al.  and  Lin  ct  al.  [Gray  81b,  Lin  82]  have  each  modeled  both  the 
probability  of  waiting  for  a  lock  request  and  the  probability  of  deadlock  in  two  phase  locking  protocols,  and 
they  conclude  that  both  probabilities  rise  with  the  degree  of  multiprogramming.  They  also  report  that  the 
probabilities  of  deadlock  and  waiting  rise  more  than  linearly  in  the  number  of  operations  per  transaction. 
These  pessimistic  conclusions  arc  based  on  very  simple  models.  They  must  be  adapted  if  they  are  to  represent 
accurately  the  behavior  of  transactions  that  access  a  hierarchically  structured  graph  of  typed  objects.  It  seems 
reasonable,  however,  to  conclude  that  if  many  transactions  frequently  access  small  groups  of  objects, 
contention  and  deadlock  would  become  serious  problems. 

To  summarize,  the  problems  of  deadlock  arc  exacerbated  in  general-purpose  transaction-based  systems. 
Further  research  is  needed  to  examine  the  applicability  of  traditional  solutions,  and  to  determine  the  tradeoffs 
among  those  solutions  in  this  environment  This  research  may  yield  variations  on  the  traditional  solutions,  or 
demonstrate  the  need  for  new  algorithms  specifically  designed  for  shared  abstract  types.  For  an  example  of  a 
new  approach  to  deadlock  avoidance,  sec  Korth’s  hierarchical  variable-granularity  locking  protocol  [Korth 
81],  which  uses  edge  locks. 

5.  Recovery 

Recovery  is  the  process  of  restoring  consistency  after  a  failure.  Recovery  properties  can  be  used  like 
synchronization  properties  to  classify  types,  and  different  recovery  techniques  are  appropriate  for  different 
classes  of  types: 

Some  types  have  operations  that  arc  uninvcrtible.  Gray  has  called  such  types  rec/[Gray  80],  because  their 
operations  correspond  to  events  in  the  "real"  world  that  arc  cither  unrepeatable  or  irreversible.  An  operation 
that  causes  a  banking  terminal  to  dispense  cash  is  an  example  of  an  uninvertiblc  update.  These  operations 
must  be  deferred  until  the  invoking  transaction  commits. 

Other  types  can  be  characterized  by  two  properties  of  their  operations:  failure  atomicity  and  permanence. 
Failure-atomic  operations  arc  always  undone  upon  transaction  abort,  and  if  all  operations  in  a  transaction  are 
failure-atomic  then  the  entire  transaction  will  be  failure-atomic.  Failure-atomic  operations  must  be  undone 


both  when  transactions  abort  during  normal  processing  and  when  transactions  arc  interrupted  by  failures. 
After  a  failure,  recovery  must  identify  and  then  abort  any  transactions  that  were  in  progress.  Operations  that 
are  not  failure*atomic  arc  useful  for  implementing  hints  efficiently.  As  discussed  in  Section  3.  incorrect  hints 
do  not  cause  fatal  errors  or  loss  of  consistency. 

Permanent  operations  are  never  undone  once  a  transaction  has  committed.  Guaranteeing  permanence, 
unlike  failure  atomicity,  requires  that  the  system  store  some  information  in  a  failure- resilient  manner.  This  is 
potentially  expensive,  and  there  arc  many  types  that  do  not  need  to  survive  failures.  The  cost  of 
reconstructing  an  object’s  state  from  other  information  after  a  failure  can  be  less  than  the  continued  cost  of 
ensuring  permanence  for  each  operation.  Operations  that  arc  non-permanent  but  failure-atomic  are  useful 
for  preserving  consistency  of  objects  that  can  be  discarded  after  failures,  but  should  remain  consistent  when 
aborts  occur  during  normal  processing. 

Underlying  any  recovery  mechanism  is  an  abstract  model  for  failures.  Lampson  has  developed  a  model 
that  distinguishes  between  two  kinds  of  failures:  errors  and  disasters  [Lampson  81]..  Under  this  model,  one  of 
the  purposes  of  recovery  is  to  mask  the  undesirable  properties  of  real  system  components  by  providing  new. 
better-behaved  abstract  components.  These  stable  components  function  identically  to  their  real  counterparts, 
except  that  they  arc  not  subject  to  errors.  However,  stable  components  remain  vulnerable  to  disasters.  By 
distinguishing  between  these  two  kinds  of  incorrect  behavior,  the  model  encourages  a  clear  delineation  of  the 
failures  that  recovery  must  handle  successfully. 

For  example,  reading  or  writing  dctcctably  incorrect  data  is  a  storage  error,  as  is  media  failure :  the 
"infrequent"  spontaneous  decay  of  correct  data.  However,  reading  or  writing  undetectably  corrupted  data  is  a 
storage  disaster.  Unlike  real  storage,  stable  storage  always  reads  and  writes  data  correctly  unless  a  disaster 
happens.  There  arc  several  ways  to  implement  stable  storage,  including  duplexed  disk  or  error-correcting 
RAM  with  a  backup  power  source. 

Incorrect  behavior  by  processors  can  similarly  be  classified  as  erroneous  or  disastrous.  If  a  processor 
detects  an  inconsistency  and  "crashes”  by  resetting  itself  and  the  system’s  volatile  memory  to  a  standard  state, 
the  behavior  is  considered  to  be  an  error.  If  an  inconsistency  slips  by  undetected,  then  a  disaster  has  taken 
place.  Stable  processors  that  recover  from  crashes  can  be  built  using  stable  storage  to  save  processor  state. 

Stable  storage  gives  programmers  the  ability  to  make  atomic  modifications  to  disk  pages  or  other  small, 
fixed-size  units  of  data.  To  provide  types  with  failure-atomic  or  permanent  operations,  the  properties  of 
stable  storage  must  be  used  to  implement  atomic  modification  of  arbitrary  collections  of  data.  Database 
systems  frequently  use  /ogg/ng[Gray  78,  Gray  81c,  Lindsay  79]  to  achieve  failure  atomicity  and  permanence 


of  transactions.  Wc  will  briefly  summarize  this  technique  and  consider  its  suitability  for  implementing  shared 
abstract  types  with  these  properties. 

Unlike  "shadow”  techniques  [Loric  77. Ijmpson  81],  in  which  transactions  manipulate  temporary  copies  of 
objects,  logging  allows  transactions  to  modify  objects  in  place.  Furthermore,  objects  can  be  transferred 
between  volatile  storage  (which  docs  not  survive  processor  errors)  and  non-volatile  storage  in  a  way  that  is 
independent  of  transaction  commitment.  Thirdly,  when  logging  is  used,  objects  themselves  do  not  have  to  be 
stored  in  stable  storage.  To  permit  the  restoration  of  consistency  if  a  failure  occurs,  transactions  append 
information  to  a  log  in  stable  storage  as  they  execute.  Because  objects  arc  modified  in  place,  the  following 
types  of  inconsistency  can  be  present  after  a  failure: 

•  Some  objects  that  committed  transactions  have  modified  may  not  have  been  copied  to  non¬ 
volatile  storage  prior  to  the  failure.  The  log  must  contain  sufficient  information  to  redo  those 
modifications  during  recovery. 

•  Some  objects  that  incomplete  (aborted)  transactions  have  modified  may  have  been  copied  to 
non-volatile  storage  prior  to  the  failure.  The  log  must  contain  sufficient  information  to  undo 
those  modifications  during  recovery. 

•  A  media  failure  may  dctcctabiy  damage  the  most  recent  copy  of  an  object  on  non-volatile  storage. 

The  log  must  have  sufficient  information  to  restore  the  object’s  current  state  from  an  archived 
version. 

Output  of  the  log  to  stable  storage  must  be  coordinated  with  the  commitment  of  transactions  and  with  the 
movement  of  objects  between  volatile  and  non-volatile  storage.  A  transaction  may  not  commit  until  the 
information  needed  to  redo  its  modifications  has  been  written  to  the  log.  Likewise,  a  modified  object  cannot 
be  migrated  to  non-volatile  storage  before  the  information  necessary  to  undo  the  change  has  been  recorded  in 
the  log.  This  tactic  is  often  referred  to  as  the  Write  Ahead  Log  protocol  [Gray  78]. 

There  arc  many  ways  to  represent  the  required  information  in  the  log,  but  they  all  have  one  aspect  in 
common.  By  definition,  a  log  is  a  linear  sequence  of  typed  records  that  can  only  be  modified  by  appending 
new  records  at  the  end.  Log  records  can  be  read  in  any  order. 

Perhaps  the  simplest  way  to  represent  log  information  is  by  recording  the  old  and  new  values  of  modified 
objects  [Lindsay  79].  Old  values  can  be  used  to  un.  orted  transactions:  new  values  can  be  used  to  redo 
committed  transactions.  The  limitations  of  this  representation  technique  come  from  its  close  relation  to 
synchronization  policy.  If  the  synchronization  rules  for  an  object  permit  concurrent  modification  by  more 
than  one  incomplete  transaction,  it  is  frequently  impossible  to  use  the  old  valuc/ncw  value  log  representation. 


This  limitation  also  applies  to  recovery  techniques  based  on  "shadow"  copies. 

An  abstract  type  that  implements  a  counter  provides  a  simple  example  of  this  limitation.  The  abstract 
properties  of  a  counter  do  not  prohibit  concurrent  increment  operations  by  multiple  transactions,  as  long  as 
the  increment  operation  docs  not  also  return  the  counter’s  value.  Suppose  the  counter  has  an  initial  value  of 
0.  The  first  increment  operation  records  an  old  value  of  0  in  the  log.  The  second  transaction  records  an  old 
value  of  1,  setting  the  current  value  to  2.  If  the  second  transaction  commits  but  the  first  transaction  later 
aborts,  restoring  the  first  transaction’s  old  value  of  0  is  incorrect. 

The  principles  behind  this  argument  can  be  formalized,  and  rigorous  criteria  for  the  applicability  of  this  log 
representation  can  be  specified.  Our  investigation  thus  far  of  synchronization  for  shared  abstract  types  has 
indicated  that  there  is  a  lot  of  concurrency  to  exploit  without  violating  reasonable  type-specific 
synchronization  properties,  and  synchronization  policies  that  take  advantage  of  this  concurrency  will  not 
always  be  compatible  with  old  value/new  value  logging. 

A  second  logging  technique  is  based  on  recording  transitions  rather  than  old  or  new  states  [Gray  81c]. 
Appropriate  inverse  transitions  can  correctly  and  independently  abort  forward  transitions.  For  the  counter, 
transition  logging  records.  "Increment”  for  each  transaction  rather  than  the  counter  value.  In  this  case,  the 
inverse  operation  is  to  decrement  the  counter. 

The  limitations  of  the  transition  method  come  from  the  difficulty  of  constructing  types  with  operations  that 
are  practical  to  invert  Sometimes  it  is  difficult  to  know  at  the  time  the  log  record  is  written  exactly  what 
information  will  be  needed  to  invert  the  operation  later  on.  For  instance,  suppose  the  counter  also  offers  a 
reset  operation.  If  a  reset  occurs  and  later  is  aborted,  die  proper  restored  value  for  the  counter  depends  not 
only  on  its  value  at  the  time  of  the  reset,  but  also  on  the  operations  that  have  occurred  since.  Examining  the 
log  and  redoing  these  intervening  operations  may  be  prohibitively  expensive. 

The  cost  and  complexity  of  logging  depends  on  a  type’s  implementation  as  well  as  on  its  abstract  properties. 
For  instance,  the  logging  algorithm  for  a  set  implemented  as  a  bit  vector  is  quite  different  from  the  logging 
algorithm  for  a  linked-list  implementation.  Further  research  is  needed  to  evaluate  the  power  of  existing 
algorithms.  This  research  should  lead  to  new  or  modified  logging  techniques  that  support  recovery  for  a 
variety  of  types  and  implementation  strategies. 

The  composability  of  types  also  complicates  recovery.  In  a  database,  the  records  at  the  leaves  of  the 
hierarchy  are  critical.  Files  and  indices  serve  only  to  organize  this  data,  and  their  function  is  explicitly 
understood  by  the  system.  It  is  therefore  appropriate  to  provide  recovery  facilities  at  the  record  level;  the 


system  can  automatically  correct  any  related  file  or  index  structures.  In  a  system  allowing  general  shared 
abstract  types,  it  is  more  difficult  to  decide  which  operations  should  be  permanent  or  failure-atomic  and 
which  should  not  If  one  type  is  used  in  the  implementation  of  another,  the  recovery  behavior  of  the 
component  type  may  not  be  appropriate  in  the  larger  context.  Like  synchronization  properties,  it  is  necessary 
to  include  recovery  properties  in  the  abstract  specification  for  a  type. 

6.  Communication 

Communication  systems  aim  to  provide  useful  and  efficient  communication  primitives.  Though  these 
goals  are  easy  to  state,  individual  communications  systems  attempt  to  meet  them  in  different  ways.  The 
communication  mechanism  of  a  transaction-based  system  is  used  both  for  the  inter-node  operation  calls  that 
occur  within  transactions  as  well  as  for  transaction  management  operations  themselves.  The  latter  group 
includes  transaction  initiation,  transaction  migration,  commit  coordination,  and  distributed  deadlock 
detection.  Though  much  is  known  about  communication  in  transaction-based  distributed  databases  [Lindsay 
79,  Gray  78J,  more  general  transaction-based  systems  have  additional  communication  requirements  and  their 
communication  systems  must  be  the  subject  of  more  study. 

The  foremost  of  these  requirements  is  high  communication  efficiency.  General  distributed  systems  may 
contain  many  brief  transactions  that  execute  frequently.  In  current  distributed  database  systems,  transactions 
last  at  least  a  few  hundred  milliseconds,  because  they  perform  reads  or  writes  to  secondary  storage. 
Performance  of  the  communication  system  is  therefore  not  critical.  General  distributed  systems,  however, 
will  use  new  types  of  low-latency  stable  storage,  and  very  efficient  communication  is  likely  to  be  important 
More  frequent  distributed  deadlock  detection  may  also  be  necessary,  especially  in  real-time  systems. 

High  availability  also  demands  high  communication  efficiency.  For  example,  frequent  operations  across 
node  boundaries  are  required  to  maintain  many  data  replicas.  Communication  efficiency  can  be  increased  by 
simplifying  protocols,  reducing  cross-level  context  switching,  and  increasing  hardware  support  for  the 
communication  system.  Communication  primitives  and  their  implementations  must  take  advantage  of  the 
properties  of  the  underlying  communication  media  and  not  rely  on  excessive  protocol  layering  [Spector  82]. 

For  instance,  consider  remote  operation  calls  on  a  network.  Assume  that  the  network’s  error  rate  is  low  in 
comparison  with  the  rate  of  occurrence  of  other  errors  such  as  deadlock.  Though  remote  call  primitives  could 
be  implemented  with  complex  error-correction  facilities,  it  is  only  necessary  that  these  primitives  have 
at-mosi-once  semantics.  That  is.  the  communication  system  must  prevent  duplicated,  corrupted,  or  out-of- 
order  operation  calls,  but  it  need  not  guarantee  that  remote  operations  arc  actually  executed  [I.iskov  82b].  If 
the  communication  medium  is  a  typical  local  area  network,  those  semantics  can  be  provided  efficiently.  It  is 
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deliberately  left  to  the  transaction  manager  to  detect  and  recover  from  other  communication  errors  (e.g..  lost 

•  messages)  by  causing  a  transaction  abort.  This  is  an  application  of  Saltzcrs  “’end-to-end"  argument  [Saltzer 

81]. 

Other  optimizations  to  transaction  communication  facilities  include  batching  transmissions  and  using 
multicast  Batching  can  be  used  to  transmit  a  group  of  updates  that  were  deferred  until  commit  time. 
Multicast  can  be  used  for  the  transmission  of  similar  remote  operations  to  multiple  sites  [Rowe  79],  for 
example,  when  transactions  access  replicated  data.  For  sufficiently  reliable  communications  media,  multicast 
messages  can  be  sent  without  requesting  acknowledgments.  The  error  recovery  facility  of  the  transaction 
manager  is  responsible  for  recovering  from  communication  errors. 

Though  eliminating  functions  from  intermediate  protocol  levels  can  improve  efficiency,  there  are  some 
problems  to  consider.  For  example,  flow  control  and  security  are  often  functions  of  intermediate-level 
protocols,  and,  when  required,  must  instead  be  added  to  the  high-level  transaction  protocols.  Additionally, 
reflecting  many  communication  errors  back  to  the  transaction  manager  can  actually  result  in  lower 
performance  if  relatively  unreliable  communication  media  are  used. 

Beyond  added  communication  efficiency,  more  demanding*  transaction  management  operations  may 
induce  other  new  requirements.  The  transaction  coordinator  may  require  that  the  various  nodes  participating 
in  a  transaction  agree  to  commit  their  operations  cotemporally.  This  problem  is  relevant  in  real-time 
transaction  processing  where  transactions  must  simultaneously  activate  several  devices. 

The  cotemporal  commit  problem  is  described  by  Gray  as  the  problem  of  N  generals  trying  to  agree,  via 
exchange  of  messages  along  an  unreliable  path,  on  a  time  for  simultaneous  attack  [Gray  78],  Protocols  that 
can  be  used  to  solve  it  are  analogous  to  2-phase  commit  protocols  but  with  an  added  constraint  concerning  the 
time  the  participants  actually  carry  out  the  commit  operation.  For  a  communication  medium  that  can  lose 
messages,  there  is  no  protocol  that  guarantees  that  the  participants  will  agree  to  commit  cotemporally. 
However,  protocols  similar  to  the  centralized  2-phasc  commit  protocols  arc  better  than  ones  similar  to  the 

*  linear  commit  protocol,  because  the  centralized  protocol  permits  the  parallel  transmission  of  messages  to  the 
participants.  This  increased  parallelism  reduces  the  interval  during  which  some  participants  may  have  agreed 
to  commit  at  a  certain  time,  whereas  others  have  not  yet  been  so  informed. 

Increased  efficiency  and  cotcmporal  transaction  commit  arc  only  two  examples  of  requirements  for 
communication  systems  that  support  general  transaction  mechanisms.  Though  such  requirements  are  similar 
to  those  of  distributed  database  systems,  there  arc  differences  that  must  be  studied  further. 
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7.  Summary 

The  goal  of  constructing  a  transaction  kernel  is  to  make  transactions  available  as  a  fundamental 
programming  construct  for  reliable  distributed  computing,  thereby  reducing  the  complexity  of  designing  and 
implementing  reliable  distributed  systems.  There  is  considerable  evidence  that  transactions  free  the 
programmer  from  continual  rcimplcmcntation  of  complex  synchronization  and  recovery  code,  and  that  they 
will  be  useful  in  the  construction  of  distributed  systems.  This  paper  has  suggested  the  possibility  of  building  a 
transaction  kernel  to  support  transactions  containing  calls  on  user-definable  shared  abstract  data  types.  It  has 
also  described  important  research  questions,  such  as  what  modifications  to  the  traditional  transactional  model 
will  be  necessary,  and  whether  systems  built  using  transactions  will  have  acceptable  efficiency. 

We  are  attempting  to  answer  these  questions  at  Carncgic-Mcllon,  in  an  effort  that  overlaps  with  the 
Archons  project  and  currently  includes  five  researchers.  Specifically,  we  arc  pursuing  research  on  the  topics 
indicated  by  the  major  section  headings  of  this  paper 

•  Extensions  to  the  transaction  model  and  the  overall  structuring  of  distributed  systems  that  utilize 
transactions,  including  the  identification  of  useful  shared  abstract  data  types. 

•  The  specification  and  lock -based  implementation  of  synchronization  for  shared  abstract  types. 

•  The  impact  of  high-concurrency  shared  abstract  types  on  deadlock  detection  and  resolution 
algorithms. 

•  The  specification  and  implementation  of  recovery  for  shared  abstract  types. 

•  The  fulfillment  of  communication  requirements  for  systems  utilizing  transactions. 

There  are  other  issues  concerning  the  general  use  of  transactions,  but  this  subset  forms  a  good  basis  for 
research  on  extending  their  utility.  We  are  not  considering  deadlock  avoidance  mechanisms,  alternatives  to 
lock-based  synchronization,  or  incorporation  of  transactions  into  programming  languages.  Work  that 
overlaps  ours  and  also  addresses  some  of  these  other  topics  is  occurring  elsewhere  [Liskov  82b,  Allchin  82). 
When  the  results  of  present  research  on  transactions  become  available,  it  should  be  possible  to  construct  a 
transaction  kernel  that  encourages  more  universal  use  of  transaction-based  programming. 
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